Patch for #269: Replace backslash with forward slash in URL #276
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves ExtractDomain by replacing backslash with forward slash in URL before passing it into Java URL class.
GitHub issue(s):
What does this Pull Request do?
This PR improves URL parsing in
ExtractDomain
by replacing backslash with forward slash before passing it into Java URL class, allowingExtractDomain
to capture the true domain of an URL.How should this be tested?
git fetch --all
git checkout fix-url
mvn clean install
mkdir -p path/to/where/ever/you/can/write/output/all-text path/to/where/ever/you/can/write/output/all-domains path/to/where/ever/you/can/write/output/gephi path/to/where/ever/you/can/write/spark-jobs
Current Results
With this PR patch:
/tuna1/scratch/aut-issue-269/derivatives/all-domains
or/tuna1/scratch/aut-issue-269/derivatives/all-domains.txt
(a combined version of all files in/tuna1/scratch/aut-issue-269/derivatives/all-domains
)/tuna1/scratch/aut-issue-269/derivatives/gephi
(doesn't contain backslash anymore, proper domainseetorontonow.canada-booknow.com
has been extracted from URL)Without this PR patch (
master
branch):/tuna1/scratch/aut-issue-269/derivatives/all-domains-without-patch
or/tuna1/scratch/aut-issue-269/derivatives/all-domains-without-patch.txt
(combined version)/tuna1/scratch/aut-issue-269/derivatives/gephi-without-patch
(contains backslash as in URLseetorontonow.canada-booknow.com\booking_results.php
)Interested parties
@lintool @ianmilligan1 @ruebot @greebie