Filebeat utf-8 encoding doesn’t honor BOM #1349

prehor · 2016-04-07T13:03:00Z

Windows often prepend UTF-8 BOM to text files, which is legal - see Can a UTF-8 data stream contain the BOM character (in UTF-8 form)?.. In my case it happens in Exchange Message Tracking logs.

Filebeat should strip BOM from files with UTF-8 encoding which it doesn't do and BOM appears in message field for the first line in a file:

{
  "@timestamp":"2016-04-07T11:58:36.922Z",
  "beat":{
    "hostname":"XXXXX",
    "name":"XXXXX"},
  "count":1,
  "fields":null,
  "input_type":"log",
  "message":"<U+FEFF>#Software: Microsoft Exchange Server",
  "offset":0,
  "source":"exchange/MSGTRK20160405-1.LOG",
  "type":"exchange"
}

Filebeat config:

filebeat:
  prospectors:
    -
      document_type: exchange
      input_type: log
      paths:
        - exchange/MSGTRK2*.LOG
      encoding: utf-8

output:
  file:
    path: logstash/output
    name: exchange

Hex dump of first line in file MSGTRK20160405-1.LOG:

00000000  ef bb bf 23 53 6f 66 74  77 61 72 65 3a 20 4d 69  |...#Software: Mi|
00000010  63 72 6f 73 6f 66 74 20  45 78 63 68 61 6e 67 65  |crosoft Exchange|
00000020  20 53 65 72 76 65 72 0d  0a                       | Server..|

First three bytes EF BB BF (UTF-8 encoded BOM) are decoded to unicode character FE FF and it appears in message field.

I'm using filebeat-1.2.0-darwin.

The text was updated successfully, but these errors were encountered:

StianOvrevage · 2016-08-18T13:35:47Z

Same error with WinLogBeats:

C:\communitor\winlogbeat>winlogbeat.exe -configtest
Loading config file error: YAML config parsing failed on C:\winlogbeat\winlogbeat.yml: yaml: invalid leading UTF-8 octet. Exiting.

This, ironically, makes you unable to edit the Windows log beat ON Windows.

ruflin · 2016-08-19T06:30:42Z

@prehor Sorry for the really late answer. Thanks a lot for providing all the details. The above implementation was kind of blocked by an update to our encoding library (#2089). This is now happening and I hope to be able to have a look at it soonish.

ruflin · 2016-08-19T06:32:05Z

@StianOvrevage Not sure if the two errors are related. The above problem is about the content of the log files. Your issue is related to the config files. Which editor did you use to modify the config file?

StianOvrevage · 2016-08-19T22:15:40Z

I'm using the default (Notepad) editor which every Windows admin uses ;) I used Notepad++ to remove the BOM but it's not an ideal solution.

StianOvrevage · 2016-08-19T22:18:06Z

But yes, I didn't realize that this issue was about reading the actual log files, I just assumed it was the same error I hit upon. But the problem is largely the same. A leading BOM will prevent Winlogbeat from starting.

ruflin · 2016-08-22T10:21:50Z

@StianOvrevage Thanks for bringing up this issue. As these two are not directly related, could you open a separate Github issue for further discussion?

ruflin · 2016-08-22T10:57:12Z

@prehor Could you provide an example log file (only a few lines) for download just to make sure we test exact the same file? The above mentioned is an issue and we will try to find ways how we can fix it. Unfortunately it complexes our reading logic quite a bit.

prehor · 2016-08-22T22:41:13Z

@ruflin There is beginning of Exchange Message Tracking file with UTF-8 BOM.
MSGTRK-SAMPLE.LOG.gz

ruflin · 2016-08-23T15:11:32Z

@prehor Based on your log file I created a PR here for discussion: #2351

Reading a file with a bom included the bom with the first event. This change removes the bom part from the first event in case it exists. * Tests for utf-8 and utf-16 added Closes elastic#1349

Reading a file with a bom included the bom with the first event. This change removes the bom part from the first event in case it exists. * Tests for utf-8 and utf-16 added Closes #1349

dannygoulder · 2018-06-10T21:08:17Z

Hi, I think am getting the same issue when trying to parse JSON log files with filebeat 6.2.4.

  "json": {
      "error": {
        "message": "Error decoding JSON: invalid character 'ï' looking for beginning of value",
        "type": "json"
      }
  }

Is this the same issue?

dannygoulder · 2018-06-10T21:10:50Z

prospector configuration allows me to select utf-16be-bom or utf-8 but not utf-8-bom

This only occurs with the first line in each log file, but the rest of the lines are correctly parsed

ph · 2018-06-11T17:38:26Z

@dannygoulder Looks oddly similar, looking at the code and tests cases we should strip the bom,
by default we will assume UTF-8, so there is no need for a special encoding.

beats/filebeat/input/log/harvester.go

Lines 260 to 264 in 1789ef9

    
           // Strip UTF-8 BOM if beginning of file 
        
           // As all BOMS are converted to UTF-8 it is enough to only remove this one 
        
           if h.state.Offset == 0 { 
        
           	message.Content = bytes.Trim(message.Content, "\xef\xbb\xbf") 
        
           }

I would check with an hex editor to see that character is at the beginning and I will also check the output of file -i myfile or on macos x file -I myfile the previous command

This is an hexdump of utf-8 file with a bom.

00000000: efbb bf68 656c 6c6f 2077 6f72 6c64 0a    ...hello world.

output of the file command

ok.md: text/plain; charset=utf-8

dannygoulder · 2018-06-11T18:29:51Z

Hi, thanks for the reply. I thought for a moment that I was getting a strange-looking BOM, but then I realised that hexdump was swapping the bytes with the default -x display (is this expected?):

$ hexdump -n 4 $FILE
0000000 bbef 7bbf
0000004
$ hexdump -C -n 4 $FILE
00000000  ef bb bf 7b                                       |...{|
00000004
$ file $FILE
myfile.log: UTF-8 Unicode (with BOM) text, with CRLF line terminators
$ file -i $FILE
myfile.log: text/plain; charset=utf-8

Any idea what I can do next?

dannygoulder · 2018-06-11T18:31:11Z

Oh, and in case it matters, the file is coming from a Windows 2016 system, upon which filebeat is running. Obviously the hexdump and file commands are being run against the file after copying it to a Linux system. :)

ph · 2018-06-11T19:21:19Z

For the hexdump, I presume we don't have the same defaults, os x vs linux.

0000000    bbef    68bf                                                
0000004

I am not sure what is going on here.

@dannygoulder lets create a new issue, lets make sure we include the Filebeat version and could you create a small file that we can use to reproduce it?

crossan007 · 2019-08-14T20:25:37Z

I'm also seeing issues on Windows Server 2019 where filebeat fails to parse the first line of a JSON log file when the file format (as indicated by Notepad++) is utf-8-bom

Using Filebeat 6.4.3, this is logged:

2019-08-14T16:18:44.708-0400	ERROR	json/json.go:51	Error decoding JSON: invalid character 'ï' looking for beginning of value

acamro · 2021-05-09T02:16:38Z

Please check for "UCS2 LE BOM" (Notepad++), or little endian BOM, the first bytes are: FF FE in my hex editor, this was found in SQL Server 2019 Logs.

Error in filebeat:
��2\u00000\u00002\u00001\u0000-\u00000\u00003\u0000-\u00000\u00001\u0000 \u00001\u00000\u0000:\u00004\u00009\u0000:\u00001\u00006\u0000.\u00008\u00006\u0000' could not be parsed at index 0

Thanks in advance

andrewkroh added Filebeat Filebeat bug labels Apr 7, 2016

ruflin mentioned this issue Aug 23, 2016

Strip bom from message #2351

Merged

StianOvrevage mentioned this issue Aug 23, 2016

Winlogbeat not parsing config files with leading BOM #2354

Closed

tsg closed this as completed in #2351 Sep 2, 2016

acamro mentioned this issue May 9, 2021

[Filebeat] Additional utfbom fixes #25624

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filebeat utf-8 encoding doesn’t honor BOM #1349

Filebeat utf-8 encoding doesn’t honor BOM #1349

prehor commented Apr 7, 2016

StianOvrevage commented Aug 18, 2016

ruflin commented Aug 19, 2016

ruflin commented Aug 19, 2016

StianOvrevage commented Aug 19, 2016

StianOvrevage commented Aug 19, 2016

ruflin commented Aug 22, 2016

ruflin commented Aug 22, 2016

prehor commented Aug 22, 2016

ruflin commented Aug 23, 2016

dannygoulder commented Jun 10, 2018

dannygoulder commented Jun 10, 2018

ph commented Jun 11, 2018

dannygoulder commented Jun 11, 2018

dannygoulder commented Jun 11, 2018 •

edited

Loading

ph commented Jun 11, 2018 •

edited

Loading

crossan007 commented Aug 14, 2019

acamro commented May 9, 2021

Filebeat utf-8 encoding doesn’t honor BOM #1349

Filebeat utf-8 encoding doesn’t honor BOM #1349

Comments

prehor commented Apr 7, 2016

StianOvrevage commented Aug 18, 2016

ruflin commented Aug 19, 2016

ruflin commented Aug 19, 2016

StianOvrevage commented Aug 19, 2016

StianOvrevage commented Aug 19, 2016

ruflin commented Aug 22, 2016

ruflin commented Aug 22, 2016

prehor commented Aug 22, 2016

ruflin commented Aug 23, 2016

dannygoulder commented Jun 10, 2018

dannygoulder commented Jun 10, 2018

ph commented Jun 11, 2018

dannygoulder commented Jun 11, 2018

dannygoulder commented Jun 11, 2018 • edited Loading

ph commented Jun 11, 2018 • edited Loading

crossan007 commented Aug 14, 2019

acamro commented May 9, 2021

dannygoulder commented Jun 11, 2018 •

edited

Loading

ph commented Jun 11, 2018 •

edited

Loading