Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filebeat utf-8 encoding doesn’t honor BOM #1349

Closed
prehor opened this issue Apr 7, 2016 · 17 comments
Closed

Filebeat utf-8 encoding doesn’t honor BOM #1349

prehor opened this issue Apr 7, 2016 · 17 comments
Labels

Comments

@prehor
Copy link

prehor commented Apr 7, 2016

Windows often prepend UTF-8 BOM to text files, which is legal - see Can a UTF-8 data stream contain the BOM character (in UTF-8 form)?.. In my case it happens in Exchange Message Tracking logs.

Filebeat should strip BOM from files with UTF-8 encoding which it doesn't do and BOM appears in message field for the first line in a file:

{
  "@timestamp":"2016-04-07T11:58:36.922Z",
  "beat":{
    "hostname":"XXXXX",
    "name":"XXXXX"},
  "count":1,
  "fields":null,
  "input_type":"log",
  "message":"<U+FEFF>#Software: Microsoft Exchange Server",
  "offset":0,
  "source":"exchange/MSGTRK20160405-1.LOG",
  "type":"exchange"
}

Filebeat config:

filebeat:
  prospectors:
    -
      document_type: exchange
      input_type: log
      paths:
        - exchange/MSGTRK2*.LOG
      encoding: utf-8

output:
  file:
    path: logstash/output
    name: exchange

Hex dump of first line in file MSGTRK20160405-1.LOG:

00000000  ef bb bf 23 53 6f 66 74  77 61 72 65 3a 20 4d 69  |...#Software: Mi|
00000010  63 72 6f 73 6f 66 74 20  45 78 63 68 61 6e 67 65  |crosoft Exchange|
00000020  20 53 65 72 76 65 72 0d  0a                       | Server..|

First three bytes EF BB BF (UTF-8 encoded BOM) are decoded to unicode character FE FF and it appears in message field.

I'm using filebeat-1.2.0-darwin.

@StianOvrevage
Copy link

Same error with WinLogBeats:

C:\communitor\winlogbeat>winlogbeat.exe -configtest
Loading config file error: YAML config parsing failed on C:\winlogbeat\winlogbeat.yml: yaml: invalid leading UTF-8 octet. Exiting.

This, ironically, makes you unable to edit the Windows log beat ON Windows.

@ruflin
Copy link
Contributor

ruflin commented Aug 19, 2016

@prehor Sorry for the really late answer. Thanks a lot for providing all the details. The above implementation was kind of blocked by an update to our encoding library (#2089). This is now happening and I hope to be able to have a look at it soonish.

@ruflin
Copy link
Contributor

ruflin commented Aug 19, 2016

@StianOvrevage Not sure if the two errors are related. The above problem is about the content of the log files. Your issue is related to the config files. Which editor did you use to modify the config file?

@StianOvrevage
Copy link

I'm using the default (Notepad) editor which every Windows admin uses ;) I used Notepad++ to remove the BOM but it's not an ideal solution.

@StianOvrevage
Copy link

But yes, I didn't realize that this issue was about reading the actual log files, I just assumed it was the same error I hit upon. But the problem is largely the same. A leading BOM will prevent Winlogbeat from starting.

@ruflin
Copy link
Contributor

ruflin commented Aug 22, 2016

@StianOvrevage Thanks for bringing up this issue. As these two are not directly related, could you open a separate Github issue for further discussion?

@ruflin
Copy link
Contributor

ruflin commented Aug 22, 2016

@prehor Could you provide an example log file (only a few lines) for download just to make sure we test exact the same file? The above mentioned is an issue and we will try to find ways how we can fix it. Unfortunately it complexes our reading logic quite a bit.

@prehor
Copy link
Author

prehor commented Aug 22, 2016

@ruflin There is beginning of Exchange Message Tracking file with UTF-8 BOM.
MSGTRK-SAMPLE.LOG.gz

@ruflin
Copy link
Contributor

ruflin commented Aug 23, 2016

@prehor Based on your log file I created a PR here for discussion: #2351

ruflin added a commit to ruflin/beats that referenced this issue Sep 1, 2016
Reading a file with a bom included the bom with the first event. This change removes the bom part from the first event in case it exists.

* Tests for utf-8 and utf-16 added

Closes elastic#1349
@tsg tsg closed this as completed in #2351 Sep 2, 2016
tsg pushed a commit that referenced this issue Sep 2, 2016
Reading a file with a bom included the bom with the first event. This change removes the bom part from the first event in case it exists.

* Tests for utf-8 and utf-16 added

Closes #1349
@dannygoulder
Copy link

Hi, I think am getting the same issue when trying to parse JSON log files with filebeat 6.2.4.

  "json": {
      "error": {
        "message": "Error decoding JSON: invalid character 'ï' looking for beginning of value",
        "type": "json"
      }
  }

Is this the same issue?

@dannygoulder
Copy link

prospector configuration allows me to select utf-16be-bom or utf-8 but not utf-8-bom

This only occurs with the first line in each log file, but the rest of the lines are correctly parsed

@ph
Copy link
Contributor

ph commented Jun 11, 2018

@dannygoulder Looks oddly similar, looking at the code and tests cases we should strip the bom,
by default we will assume UTF-8, so there is no need for a special encoding.

// Strip UTF-8 BOM if beginning of file
// As all BOMS are converted to UTF-8 it is enough to only remove this one
if h.state.Offset == 0 {
message.Content = bytes.Trim(message.Content, "\xef\xbb\xbf")
}

I would check with an hex editor to see that character is at the beginning and I will also check the output of file -i myfile or on macos x file -I myfile the previous command

This is an hexdump of utf-8 file with a bom.

00000000: efbb bf68 656c 6c6f 2077 6f72 6c64 0a    ...hello world.

output of the file command

ok.md: text/plain; charset=utf-8

@dannygoulder
Copy link

Hi, thanks for the reply. I thought for a moment that I was getting a strange-looking BOM, but then I realised that hexdump was swapping the bytes with the default -x display (is this expected?):

$ hexdump -n 4 $FILE
0000000 bbef 7bbf
0000004
$ hexdump -C -n 4 $FILE
00000000  ef bb bf 7b                                       |...{|
00000004
$ file $FILE
myfile.log: UTF-8 Unicode (with BOM) text, with CRLF line terminators
$ file -i $FILE
myfile.log: text/plain; charset=utf-8

Any idea what I can do next?

@dannygoulder
Copy link

dannygoulder commented Jun 11, 2018

Oh, and in case it matters, the file is coming from a Windows 2016 system, upon which filebeat is running. Obviously the hexdump and file commands are being run against the file after copying it to a Linux system. :)

@ph
Copy link
Contributor

ph commented Jun 11, 2018

For the hexdump, I presume we don't have the same defaults, os x vs linux.

0000000    bbef    68bf                                                
0000004

I am not sure what is going on here.

@dannygoulder lets create a new issue, lets make sure we include the Filebeat version and could you create a small file that we can use to reproduce it?

@crossan007
Copy link

I'm also seeing issues on Windows Server 2019 where filebeat fails to parse the first line of a JSON log file when the file format (as indicated by Notepad++) is utf-8-bom

Using Filebeat 6.4.3, this is logged:

2019-08-14T16:18:44.708-0400	ERROR	json/json.go:51	Error decoding JSON: invalid character 'ï' looking for beginning of value

@acamro
Copy link

acamro commented May 9, 2021

Please check for "UCS2 LE BOM" (Notepad++), or little endian BOM, the first bytes are: FF FE in my hex editor, this was found in SQL Server 2019 Logs.

Error in filebeat:
��2\u00000\u00002\u00001\u0000-\u00000\u00003\u0000-\u00000\u00001\u0000 \u00001\u00000\u0000:\u00004\u00009\u0000:\u00001\u00006\u0000.\u00008\u00006\u0000' could not be parsed at index 0

Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants