Declarative DOM extraction expression evaluator.
Powerful, succinct, composable, extendable, declarative API.
articles:
- select article {0,}
- body:
- select .body
- read property innerHTML
imageUrl:
- select img
- read attribute src
summary:
- select ".body p:first-child"
- read property innerHTML
- format text
title:
- select .title
- read property textContent
pageName:
- select .body
- read property innerHTML
Not succinct enough for you? Use aliases and the pipe operator (
|
) to shorten and concatenate the commands:articles: - sm article - body: s .body | rp innerHTML imageUrl: s img | ra src summary: s .body p:first-child | rp innerHTML | f text title: s .title | rp textContent pageName: s .body | rp innerHTML
Have you got suggestions for improvement? I am all ears.
- Configuration
- Evaluators
- Subroutines
- Expression reference
- Cookbook
- Error handling
- Debugging
Name | Type | Description | Default value |
---|---|---|---|
evaluator |
EvaluatorType |
HTML parser and selector engine. See evaluators. | browser evaluator if window and document variables are present, cheerio otherwise. |
subroutines |
$PropertyType<UserConfigurationType, 'subroutines'> |
User defined subroutines. See subroutines. | N/A |
Subroutines use an evaluator to parse input (i.e. convert a string into an object) and to select nodes in the resulting document.
The default evaluator is configured based on the user environment:
browser
evaluator is used ifwindow
anddocument
variables are defined; otherwisecheerio
Have a use case for another evaluator? Raise an issue.
For an example implementation of an evaluator, refer to:
Uses native browser methods to parse the document and to evaluate CSS selector queries.
Use browser
evaluator if you are running Surgeon in a browser or a headless browser (e.g. PhantomJS).
import {
browserEvaluator
} from './evaluators';
surgeon({
evaluator: browserEvaluator()
});
Uses cheerio to parse the document and to evaluate CSS selector queries.
Use cheerio
evaluator if you are running Surgeon in Node.js.
import {
cheerioEvaluator
} from './evaluators';
surgeon({
evaluator: cheerioEvaluator()
});
A subroutine is a function used to advance the DOM extraction expression evaluator, e.g.
x('foo | bar baz', 'qux');
In the above example, Surgeon expression uses two subroutines: foo
and bar
.
foo
subroutine is invoked without additional values. bar
subroutine is executed with 1 value ("baz").
Subroutines are executed in the order in which they are defined β the result of the last subroutine is passed on to the next one. The first subroutine receives the document input (in this case: "qux" string).
Multiple subroutines can be written as an array. The following example is equivalent to the earlier example.
x([
'foo',
'bar baz'
], 'qux');
There are two types of subroutines:
Note:
These functions are called subroutines to emphasise the cross-platform nature of the declarative API.
The following subroutines are available out of the box.
append
appends a string to the input string.
Parameter name | Description | Default |
---|---|---|
tail | Appends a string to the end of the input string. | N/A |
Examples:
// Assuming an element <a href='http://foo' />,
// then the result is 'http://foo/bar'.
x(`select a | read attribute href | append '/bar'`);
closest
subroutine iterates through all the preceding nodes (including parent nodes) searching for either a preceding node matching the selector expression or a descendant of the preceding node matching the selector.
Note: This is different from the jQuery .closest()
in that the latter method does not search for parent descendants matching the selector.
Parameter name | Description | Default |
---|---|---|
CSS selector | CSS selector used to select an element. | N/A |
constant
returns the parameter value regardless of the input.
Parameter name | Description | Default |
---|---|---|
constant |
Constant value that will be returned as the result. | N/A |
format
is used to format input using printf.
Parameter name | Description | Default |
---|---|---|
format | sprintf format used to format the input string. The subroutine input is the first argument, i.e. %1$s . |
%1$s |
Examples:
// Extracts 1 matching capturing group from the input string.
// Prefixes the match with 'http://foo.com'.
x(`select a | read attribute href | format 'http://foo.com%1$s'`);
match
is used to extract matching capturing groups from the subject input.
Parameter name | Description | Default |
---|---|---|
Regular expression | Regular expression used to match capturing groups in the string. | N/A |
Sprintf format | sprintf format used to construct a string using the matching capturing groups. | %s |
Examples:
// Extracts 1 matching capturing group from the input string.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | match "/input: (\d+)/"');
// Extracts 2 matching capturing groups from the input string and formats the output using sprintf.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | match "/input: (\d+)-(\d+)/" %2$s-%1$s');
nextUntil
subroutine is used to select all following siblings of each element up to but not including the element matched by the selector.
Parameter name | Description | Default |
---|---|---|
selector expression | A string containing a selector expression to indicate where to stop matching following sibling elements. | N/A |
filter expression | A string containing a selector expression to match elements against. |
prepend
prepends a string to the input string.
Parameter name | Description | Default |
---|---|---|
head | Prepends a string to the start of the input string. | N/A |
Examples:
// Assuming an element <a href='//foo' />,
// then the result is 'http://foo/bar'.
x(`select a | read attribute href | prepend 'http:'`);
previous
subroutine selects the preceding sibling.
Parameter name | Description | Default |
---|---|---|
CSS selector | CSS selector used to select an element. | N/A |
Example:
<ul>
<li>foo</li>
<li class='bar'></li>
<ul>
x('select .bar | previous | read property textContent');
// 'foo'
read
is used to extract value from the matching element using an evaluator.
Parameter name | Description | Default |
---|---|---|
Target type | Possible values: "attribute" or "property" | N/A |
Target name | Depending on the target type, name of an attribute or a property. | N/A |
Examples:
// Returns .foo element "href" attribute value.
// Throws error if attribute does not exist.
x('select .foo | read attribute href');
// Returns an array of "href" attribute values of the matching elements.
// Throws error if attribute does not exist on either of the matching elements.
x('select .foo {0,} | read attribute href');
// Returns .foo element "textContent" property value.
// Throws error if property does not exist.
x('select .foo | read property textContent');
remove
subroutine is used to remove elements from the document using an evaluator.
remove
subroutine accepts the same parameters as the select
subroutine.
The result of remove
subroutine is the input of the subroutine, i.e. previous select
subroutine result.
Parameter name | Description | Default |
---|---|---|
CSS selector | CSS selector used to select an element. | N/A |
Quantifier expression | A quantifier expression is used to control the expected result length. | See quantifier expression. |
Examples:
// Returns 'bar'.
x('select .foo | remove span | read property textContent', `<div class='foo'>bar<span>baz</span></div>`);
select
subroutine is used to select the elements in the document using an evaluator.
Parameter name | Description | Default |
---|---|---|
CSS selector | CSS selector used to select an element. | N/A |
Quantifier expression | A quantifier expression is used to control the shape of the results (direct result or array of results) and the expected result length. | See quantifier expression. |
A quantifier expression is used to assert that the query matches a set number of nodes. A quantifier expression is a modifier of the select
subroutine.
A quantifier expression is defined using the following syntax.
Name | Syntax |
---|---|
Fixed quantifier | {n} where n is an integer >= 1 |
Greedy quantifier | {n,m} where n >= 0 and m >= n |
Greedy quantifier | {n,} where n >= 0 |
Greedy quantifier | {,m} where m >= 1 |
A quantifier expression can be appended a node selector [i]
, e.g. {0,}[1]
. This allows to return the first node from the result set.
If this looks familiar, its because I have adopted the syntax from regular expression language. However, unlike in regular expression, a quantifier in the context of Surgeon selector will produce an error (
SelectSubroutineUnexpectedResultCountError
) if selector result length is out of the quantifier range.
Examples:
// Selects 0 or more nodes.
// Result is an array.
x('select .foo {0,}');
// Selects 1 or more nodes.
// Throws an error if 0 matches found.
// Result is an array.
x('select .foo {1,}');
// Selects between 0 and 5 nodes.
// Throws an error if more than 5 matches found.
// Result is an array.
x('select .foo {0,5}');
// Selects 1 node.
// Result is the first match in the result set (or `null`).
x('select .foo {0,}[0]');
test
is used to validate the current value using a regular expression.
Parameter name | Description | Default |
---|---|---|
Regular expression | Regular expression used to test the value. | N/A |
Examples:
// Validates that .foo element textContent property value matches /bar/ regular expression.
// Throws `InvalidDataError` if the value does not pass the test.
x('select .foo | read property textContent | test /bar/');
See error handling for more information and usage examples of the test
subroutine.
Custom subroutines can be defined using subroutines
configuration.
A subroutine is a function. A subroutine function is invoked with the following parameters:
Parameter name |
---|
An instance of [Evaluator]. |
Current value, i.e. value used to query Surgeon or value returned from the previous (or ancestor) subroutine. |
An array of values used when referencing the subroutine in an expression. |
Example:
const x = surgeon({
subroutines: {
mySubroutine: (currentValue, [firstParameterValue, secondParameterValue]) => {
console.log(currentValue, firstParameterValue, secondParameterValue);
return parseInt(currentValue, 10) + 1;
}
}
});
x('mySubroutine foo bar | mySubroutine baz qux', 0);
The above example prints:
0 "foo" "bar"
1 "baz" "qux"
For more examples of defining subroutines, refer to:
- Validate the results using a user-defined test function.
- Source code of the the built-in subroutines.
Custom subroutines can be inlined into pianola instructions, e.g.
x(
[
'foo',
(subject) => {
// `subject` is the return value of `foo` subroutine.
return 'bar';
},
'baz',
],
'qux'
);
Surgeon exports an alias preset is used to reduce verbosity of the queries.
Name | Description |
---|---|
ra ... |
Reads Element attribute value. Equivalent to read attribute ... |
rdtc ... |
Removes any descending elements and reads the resulting textContent property of an element. Equivalent to `remove * {0,} |
rih ... |
Reads innerHTML property of an element. Equivalent to read property ... innerHTML |
roh ... |
Reads outerHTML property of an element. Equivalent to read property ... outerHTML |
rp ... |
Reads Element property value. Equivalent to read property ... |
rtc ... |
Reads textContent property of an element. Equivalent to read property ... textContent |
sa ... |
Select any (sa). Selects multiple elements (0 or more). Returns array. Equivalent to select "..." {0,} |
saf ... |
Select any first (saf). Selects multiple elements (0 or more). Returns single result or null . Equivalent to select "..." {0,}[0] |
sm ... |
Select many (sm). Selects multiple elements (1 or more). Returns array. Equivalent to select "..." {1,} |
smo ... |
Select maybe one (smo). Selects one element. Returns single result or null . Equivalent to select "..." {0,1}[0] |
so ... |
Select one (so). Selects a single element. Returns single result. Equivalent to select "..." {1}[0] . |
t {name} |
Tests value. Equivalent to test ... |
Note regarding
s ...
alias. The CSS selector value is quoted. Therefore, you can write a CSS selector that includes spaces without putting the value in the quotes, e.g.s .foo .bar
is equivalent toselect ".foo .bar" {1}
.Other alias values are not quoted. Therefore, if value includes a space it must be quoted, e.g.
t "/foo bar/"
.
Usage:
import surgeon, {
subroutineAliasPreset
} from 'surgeon';
const x = surgeon({
subroutines: {
...subroutineAliasPreset
}
});
x('s .foo .bar | t "/foo bar/"');
In addition to the built-in aliases, user can declare subroutine aliases.
Surgeon subroutines are referenced using expressions.
An expression is defined using the following pseudo-grammar:
subroutines ->
subroutines _ "|" _ subroutine
| subroutine
subroutine ->
subroutineName " " parameters
| subroutineName
subroutineName ->
[a-zA-Z0-9\-_]:+
parameters ->
parameters " " parameter
| parameter
Example:
x('foo bar baz', 'qux');
In this example, Surgeon query executor (x
) is invoked with foo bar baz
expression and qux
starting value. The expression tells the query executor to run foo
subroutine with parameter values "bar" and "baz". The expression executor runs foo
subroutine with parameter values "bar" and "baz" and subject value "qux".
Multiple subroutines can be combined using an array:
x([
'foo bar baz',
'corge grault garply'
], 'qux');
In this example, Surgeon query executor (x
) is invoked with two expressions (foo bar baz
and corge grault garply
). The first subroutine is executed with the subject value "qux". The second subroutine is executed with a value that is the result of the parent subroutine.
The result of the query is the result of the last subroutine.
Read user-defined subroutines documentation for broader explanation of the role of the parameter values and the subject value.
Multiple subroutines can be combined using the pipe operator.
The following examples are equivalent:
x([
'foo bar baz',
'qux quux quuz'
]);
x([
'foo bar baz | foo bar baz'
]);
x('foo bar baz | foo bar baz');
Unless redefined, all examples assume the following initialisation:
import surgeon from 'surgeon';
/**
* @param configuration {@see https://github.com/gajus/surgeon#configuration}
*/
const x = surgeon();
Use select
subroutine and read
subroutine to extract a single value.
const subject = `
<div class="title">foo</div>
`;
x('select .title | read property textContent', subject);
// 'foo'
Specify select
subroutine quantifier
to match multiple results.
const subject = `
<div class="foo">bar</div>
<div class="foo">baz</div>
<div class="foo">qux</div>
`;
x('select .title {0,} | read property textContent', subject);
// [
// 'bar',
// 'baz',
// 'qux'
// ]
Use a QueryChildrenType
object to name the results of the descending expressions.
const subject = `
<article>
<div class='title'>foo title</div>
<div class='body'>foo body</div>
</article>
<article>
<div class='title'>bar title</div>
<div class='body'>bar body</div>
</article>
`;
x([
'select article',
{
body: 'select .body | read property textContent'
title: 'select .title | read property textContent'
}
]);
// [
// {
// body: 'foo body',
// title: 'foo title'
// },
// {
// body: 'bar body',
// title: 'bar title'
// }
// ]
Use test
subroutine to validate the results.
const subject = `
<div class="foo">bar</div>
<div class="foo">baz</div>
<div class="foo">qux</div>
`;
x('select .foo {0,} | test /^[a-z]{3}$/');
See error handling for information how to handle test
subroutine errors.
Define a custom subroutine to validate results using arbitrary logic.
Use InvalidValueSentinel
to leverage standardised Surgeon error handler (see error handling). Otherwise, simply throw an error.
import surgeon, {
InvalidValueSentinel
} from 'surgeon';
const x = surgeon({
subroutines: {
isRed: (value) => {
if (value === 'red') {
return value;
};
return new InvalidValueSentinel('Unexpected color.');
}
}
});
As you become familiar with the query execution mechanism, typing long expressions (such as select
, read attribute
and read property
) becomes a mundane task.
Remember that subroutines are regular functions: you can partially apply and use the partially applied functions to create new subroutines.
Example:
import surgeon, {
readSubroutine,
selectSubroutine,
testSubroutine
} from 'surgeon';
const x = surgeon({
subroutines: {
ra: (subject, values, bindle) => {
return readSubroutine(subject, ['attribute'].concat(values), bindle);
},
rp: (subject, values, bindle) => {
return readSubroutine(subject, ['property'].concat(values), bindle);
},
s: (subject, values, bindle) => {
return selectSubroutine(subject, [values.join(' '), '{1}'], bindle);
},
sm: (subject, values, bindle) => {
return selectSubroutine(subject, [values.join(' '), '{0,}'], bindle);
},
t: testSubroutine
}
});
Now, instead of writing:
articles:
- select article
- body:
- select .body
- read property innerHTML
You can write:
articles:
- sm article
- body:
- s .body
- rp innerHTML
The aliases used in this example are available in the aliases preset (read built-in subroutine aliases).
Surgeon throws the following errors to indicate a predictable error state. All Surgeon errors can be imported. Use instanceof
operator to determine the error type.
Note:
Surgeon errors are non-recoverable, i.e. a selector cannot proceed if it encounters an error. This design ensures that your selectors are capturing the expected data.
Name | Description |
---|---|
ReadSubroutineNotFoundError |
Thrown when an attempt is made to retrieve a non-existent attribute or property. |
SelectSubroutineUnexpectedResultCountError |
Thrown when a select subroutine result length does not match the quantifier expression. |
InvalidDataError |
Thrown when a subroutine returns an instance of InvalidValueSentinel . |
SurgeonError |
A generic error. All other Surgeon errors extend from SurgeonError . |
Example:
import {
InvalidDataError
} from 'surgeon';
const subject = `
<div class="foo">bar</div>
`;
try {
x('select .foo | test /bar/', subject);
} catch (error) {
if (error instanceof InvalidDataError) {
// Handle data validation error.
} else {
throw error;
}
}
Return InvalidValueSentinel
from a subroutine to force Surgeon throw InvalidDataError
error.
Surgeon is using roarr
to log debugging information.
Export ROARR_LOG=TRUE
environment variable to enable Surgeon debug log.