Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better auto columnizer #265

Open
achimmihca opened this issue Sep 15, 2022 · 1 comment
Open

Better auto columnizer #265

achimmihca opened this issue Sep 15, 2022 · 1 comment

Comments

@achimmihca
Copy link

achimmihca commented Sep 15, 2022

The "auto columnizer" was not helpful in my case. It just falls back to "default (single line)".

Instead, I suggest that the auto columnizer attempts different strategies to split the first N (e.g. 100) lines into columns.
The quality of the tested strategies could then be determined by checking how many lines have the same number of columns. The more lines have the same number of columns, the better.

Then it can show the best 3 strategies and how they performed to the user (or just show all strategies and how they performed).
Out of these, the user can select the strategy to be used.
Note that this requires some sort of auto columnizer dialog ("wizard").

Possible strategies:

  • Try existing columnizers
  • Maybe machine learning could be used to columnize log files?
  • Try to detect columns as "connected blocks"
    • replace all non-white-space characters (regex [^\s]) with x
      • This results in connected blocks, e.g. xxxxxxxxxxx xxxxxxxxxxxx xx xxx xxxxxxx xxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    • Construct prefixes of lines. Use the prefix as sample line that is the longest prefix of as many lines as possible
      • Prefixes would be x, xx, xxx etc.
      • To rate prefixes, one could multiply the length of the prefix with the number of lines that have this prefix. The prefix with highest rating is the sample line.
    • In the sample line, search for connected blocks (regex [^\s]+\s|\s[^\s]+)
      • these connected blocks are the columns, plus one final column for the rest of the line (the part that is not in the prefix).
        • One columnizer strategy could be to split the connected blocks based on character position
          • E.g. split on position 12, 25, 28, ...
        • Another columnizer strategy could be to count the blocks and match each block in the form (\s*[^\s]+\s*)
          • E.g. columnizer regex for 3 blocks: (\s*[^\s]+\s*)(\s*[^\s]+\s*)(\s*[^\s]+\s*)(.*)
          • BTW: This could also be an easily configurable columnizer. Just let the user define the number of columns / blocks.
@achimmihca
Copy link
Author

achimmihca commented Sep 15, 2022

Using a sample line with connected blocks would also make for a nice semi-automatic columnizer wizard.

  • The user can pick a sample line.
  • The user can select from where to where are the relevant sample blocks (split line on position)
  • Or the user could specify the number of blocks (split first N connected blocks, plus a final column for the rest of the line)

This would be much easier, faster, and more visual than manually constructing RegEx for columnizers.

EDIT:
Actually, it is already quite simple to create an "N blocks plus rest" columnizer. This is a special case of existing "RegEx Columnizer".
However, I think it would be worth to have a dedicated "N blocks plus rest" columnizer for simpler configuration.
It uses the RegEx (\s*[^\s]+\s*) N times and ends with (.*).
Configuration would be: (a) number of blocks, (b) separator character between blocks (in my example I used space \s but it could also be semicolon etc.)

A better RegEx for a single block might be "at least one space at start, or at least one space at end" (\s+[^\s]+|[^\s]+\s+)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant