Skip to content

Commit

Permalink
doc: correct internal page links (#470)
Browse files Browse the repository at this point in the history
Specifically, to the cleaning content and using transform sections.
  • Loading branch information
jfix authored and mtashley committed Aug 16, 2019
1 parent 398cba4 commit 76d59f2
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions src/extractors/custom/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -334,15 +334,15 @@ You can refer to the [NewYorkerExtractor](www.newyorker.com/index.js) to see mor
### Step 4: Content extraction
I've left content extraction for last, since it's often the trickiest, sometimes requiring special passes to [clean](#cleaning-content) and [transform](#using-tranforms) the content. For the New Yorker, the first part is easy: The selector for this page is clearly `div#articleBody`. But that's just our first step, because unlike the other tests, where we want to make sure we're matching a simple string, we need to sanity check that the page looks good when it's rendered, and that there aren't any elements returned by our selector that we don't want.
I've left content extraction for last, since it's often the trickiest, sometimes requiring special passes to [clean](#cleaning-content-from-an-article) and [transform](#using-transforms) the content. For the New Yorker, the first part is easy: The selector for this page is clearly `div#articleBody`. But that's just our first step, because unlike the other tests, where we want to make sure we're matching a simple string, we need to sanity check that the page looks good when it's rendered, and that there aren't any elements returned by our selector that we don't want.
To aid you in previewing the results, you can run the `./preview` script to see what the title and content output look like. So, after you've chosen your selector, run the preview script on the URL you're testing:
```bash
./preview http://www.newyorker.com/tech/elements/hacking-cryptography-and-the-countdown-to-quantum-computing
```
This script will open both an `html` and `json` file allowing you to preview your results. Luckily for us, the New Yorker content is simple, and doesn't require any unusual cleaning or transformations — at least not in this example. Remember that if you do see content that needs cleaned or transformed in the selected content, you can follow the instructions in the [clean](#cleaning-content) and [transform](#using-tranforms) sections above.
This script will open both an `html` and `json` file allowing you to preview your results. Luckily for us, the New Yorker content is simple, and doesn't require any unusual cleaning or transformations — at least not in this example. Remember that if you do see content that needs cleaned or transformed in the selected content, you can follow the instructions in the [clean](#cleaning-content-from-an-article) and [transform](#using-transforms) sections above.
## Submitting a custom extractor
Expand Down

0 comments on commit 76d59f2

Please sign in to comment.