Cookies in CustomHeaders not correctly used & building altered headers (Hybrid) #930
Labels
Status: Completed
Nothing further to be done with this issue. Awaiting to be closed.
Type: Bug
Inconsistencies or issues which will cause an issue or problem for users or implementors.
Milestone
katana version:
Katana version: v1.1.0
Current Behavior:
If cookies are added to customHeaders, they are not integrated directly into the browser. As a result, they are used when the first request is sent, but they are not directly added to the browser, so in some cases the information is lost for the rest of the crawling. This problem is very restrictive when trying to perform an authenticated crawling and the authentication vector is a cookie.
What's more, the reconstruction of the request to insert it into the output is based solely on the customHeaders and not on the headers linked to the request sent by the browser. As a result, the request written to the output doesn't really correspond to the request sent by the browser.
Detailed explanation of bug source
If we want to add custom cookies for an authenticated crawl, we need to use option H, CustomHeaders. This data is added to the Headers field of the Shared.
Custom headers are then used when crawling a web page. They are added to the headers of the page in question using the Shared addHeadersToPage function. Custom headers are then used when crawling a web page. They are added to the headers of the page in question by the Shared addHeadersToPage function. This function calls page.SetExtraHeaders, which can lead to a bug.
During crawling, when accessing a certain page, there may be a Set-Cookie in the response. A cookie will be initialized in the browser. As a result, even if custom cookies are specified in the option, they will not be added to the page headers, as page.SetExtraHeaders only adds a value if it doesn't exist. In the case of a crawl authenticated via a certain cookie, this value may be lost during the crawl.
For example, During the first crawl, a foo=bar cookie is present, but during the next crawl this information has disappeared because cookies have been initialized. So SetExtraHeaders will not add Cookie because the value is already set.
In addition, adding headers to recreate the output request does not coincide with the real request sent by the browser. During the crawling, the browser can add dynamically headers and cookies, but the reconstruction is based solely on the custom headers entered as input.
Genuine request:
Output request:
Expected Behavior:
Create an option to load cookies when the browser is initialized. In hybrid mode, cookies can't simply be added to headers - they have to be inserted into the browser to emulate real browser behavior. The use of cookies and headers must be dissociated in this context.
To rebuild request headers, simply use the headers linked to the hijacked request (e proto....). The latter contains all information, including customHeaders.
Steps To Reproduce:
Example: steps to reproduce the behavior :
The text was updated successfully, but these errors were encountered: