Better auto columnizer #265

achimmihca · 2022-09-15T12:25:22Z

The "auto columnizer" was not helpful in my case. It just falls back to "default (single line)".

Instead, I suggest that the auto columnizer attempts different strategies to split the first N (e.g. 100) lines into columns.
The quality of the tested strategies could then be determined by checking how many lines have the same number of columns. The more lines have the same number of columns, the better.

Then it can show the best 3 strategies and how they performed to the user (or just show all strategies and how they performed).
Out of these, the user can select the strategy to be used.
Note that this requires some sort of auto columnizer dialog ("wizard").

Possible strategies:

Try existing columnizers
Maybe machine learning could be used to columnize log files?
Try to detect columns as "connected blocks"
- replace all non-white-space characters (regex [^\s]) with x
  - This results in connected blocks, e.g. xxxxxxxxxxx xxxxxxxxxxxx xx xxx xxxxxxx xxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
- Construct prefixes of lines. Use the prefix as sample line that is the longest prefix of as many lines as possible
  - Prefixes would be x, xx, xxx etc.
  - To rate prefixes, one could multiply the length of the prefix with the number of lines that have this prefix. The prefix with highest rating is the sample line.
- In the sample line, search for connected blocks (regex [^\s]+\s|\s[^\s]+)
  - these connected blocks are the columns, plus one final column for the rest of the line (the part that is not in the prefix).
    - One columnizer strategy could be to split the connected blocks based on character position
      - E.g. split on position 12, 25, 28, ...
    - Another columnizer strategy could be to count the blocks and match each block in the form (\s*[^\s]+\s*)
      - E.g. columnizer regex for 3 blocks: (\s*[^\s]+\s*)(\s*[^\s]+\s*)(\s*[^\s]+\s*)(.*)
      - BTW: This could also be an easily configurable columnizer. Just let the user define the number of columns / blocks.

The text was updated successfully, but these errors were encountered:

achimmihca · 2022-09-15T12:31:34Z

Using a sample line with connected blocks would also make for a nice semi-automatic columnizer wizard.

The user can pick a sample line.
The user can select from where to where are the relevant sample blocks (split line on position)
Or the user could specify the number of blocks (split first N connected blocks, plus a final column for the rest of the line)

This would be much easier, faster, and more visual than manually constructing RegEx for columnizers.

EDIT:
Actually, it is already quite simple to create an "N blocks plus rest" columnizer. This is a special case of existing "RegEx Columnizer".
However, I think it would be worth to have a dedicated "N blocks plus rest" columnizer for simpler configuration.
It uses the RegEx (\s*[^\s]+\s*) N times and ends with (.*).
Configuration would be: (a) number of blocks, (b) separator character between blocks (in my example I used space \s but it could also be semicolon etc.)

A better RegEx for a single block might be "at least one space at start, or at least one space at end" (\s+[^\s]+|[^\s]+\s+)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better auto columnizer #265

Better auto columnizer #265

achimmihca commented Sep 15, 2022 •

edited

Loading

achimmihca commented Sep 15, 2022 •

edited

Loading

Better auto columnizer #265

Better auto columnizer #265

Comments

achimmihca commented Sep 15, 2022 • edited Loading

achimmihca commented Sep 15, 2022 • edited Loading

achimmihca commented Sep 15, 2022 •

edited

Loading

achimmihca commented Sep 15, 2022 •

edited

Loading