Skip to content

BinitaBharati/jilapi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jilapi

Jilapi is a Java library to parse unstructured data. The core essence of Jilapi is that given a unstructured input, it produces a structured output. Jilapi was originally written to parse Linux command line output, but it has evolved to handle varying kinds of unstructured data.

Jilapi terminologies

  • A single unit of useful data is called "Entity".
  • An "Entity" consists of multiple fields.

How it works

  • Jilapi categorizes unstructured data into 3 types:
  • Tabular data - The data resembles a table.Each line/row in the table represents a meaningful entity.
  • Chunked data - The data is like a blob.A meaningful entity may span across multiple lines.
  • Hierarchical data - The data represents a hierarchy, spanning across multiple lines.Each entity may be present in a single line, but the entity will have relation with entities in the other lines.
  • Jilapi library takes the following types of inputs:
    • InputStream
    • A single String with a new line delimiter to mark individual lines.
  • The result of the command output parsing is given out either in Java Object or in JSON format.
  • It is a standalone jar
  • The user of the library is required to provide a Properties reference which has command parsing rules. The Properties reference can be instantiated as considered apt by the user, ie sourced from a file/DB etc. Please check the sample jilapi.properties file for reference.

Build

  • git clone https://github.com/BinitaBharati/jilapi.git
  • cd jilapi
  • mvn clean;mvn package - generates the jilapi uber jar.
  • The generated jilapi jar can be included in other projects.

Test

  • mvn test

Jilapi property

This is the heart of Jilapi. For every command, the user of this library has to pass appropriate Properties loaded with command parsing rules. Below is a summary of all the possible properties:

  1. <CMND_KEY>.parser.type: The command parser types.Currently, 3 parser types are supported:
    • TabularParser - handles tabular data.This reads data line-wise.
    • ChunkedParser - handles data where-in a entity spans across multiple lines.ie entity data is available in chunks. See ifconfig -a
    • NestedParser - handles nested/hierarchical data.This reads data line wise.See nested output
  2. <CMND_KEY>.entity.end: The delimiter marking end of a complete entity.Entity is a the smallest unit of useful data. A command output can have multiple entity instances.TabularParser and NestedParser currently supports a new line as entity default delimiter.So, this field need not be specified when parser is tabular/nested.Chunked parser works with blocks of meaningful data.So, entity delimiter need not be a new line, and has to be specified explicitly.
  3. <CMND_KEY>.result.entity.field.delimiter: The delimiter used to delimit across individual fields of a entity.This should be a unique character across the whole of the unstrcutured data.The default field delimiter when not specified is SPACE.
  4. <CMND_KEY>.result.sections: Useful when the command output has multiple sections.May not be applicable for all commands. If a command output has multiple sections, they are demarked using a semi colon character.Please check the 'cmnd3' properties for a demo of the <CMND_KEY>.result.sections property.
  5. <CMND_KEY>.result.header: The output line preceding the start of the actual data.May not be applicable for all commands.
  6. <CMND_KEY>.result.footer: The output line following the end of the actual data.May not be applicable for all commands.
  7. <CMND_KEY>.result.ignore: The output line that needs to be ignored.May not be applicable for all commands.
  8. <CMND_KEY>.result.entity.field.positional.map: Applicable only for tabular data. It is a map representing the position of the fields of an entity.The map should contain the field positions in ascending order. Eg -> 1:fieldA,4:fieldB,10:fieldC is valid. But, 1:fieldA,10:fieldC,4:fieldB is invalid. Also, a single field can spawn across multiple positions (columns) in the output line.See cmnd1's buildTime for a sample of the same.
  9. <CMND_KEY>.result.stop: If present, indicates where to stop parsing the given command output.Do not confuse this with footer, as with footer parsing will keep continuing till EOF, but with stop, parsing completely stops.
  10. <CMND_KEY>.result.entity.field.parser: Implementation of com.github.binitabharati.jilapi.entity.parser.EntityParser. Applicable when the command parser type is chunked/nested.
  11. <CMND_KEY>.nested.hierarchy.id: Implementation of com.github.binitabharati.jilapi.parser.worker.NestedHierarchyIdentifier. Applicable when command parser type is Nested. This property provieds a way to identify each element in a nested hierarchy.
  12. <CMND_KEY>.nested.hierarchy: A String representing the nested hierarchy. This entry has the following rules:
    • Each independent hierarchy is demarked with a semi colon.In this context, independent hierarchy means that the hierarchy is wholly unrelated to any existing hierarchy entries.
    • Each element name within the hierarchy has to be prefixed and suffixed with a % character.
    • Every child hierarchy has to be enclosed within [ and ].
    • A parent will specify its children by preceding it with a ->.
    • A eg entry would be:
      %A2%->[%A4%->[%A8%->[%A13%,%A14%],%A9%],%A5%];%A3%->[%A6%,%A7%];%A10%->[%A11%->[%A12%]]

Quick Start

Lets see few sample commands.

uname -a

Executing uname -a on a Linux system generates the following output: Alt text

Now, lets understand what attributes of jilapi property file matters in this case.

  • <CMND_KEY>.parser.type: The data can be visialized as tabular data with only a single row and multiple columns.
  • <CMND_KEY>.entity.end: Here a single line contains a complete meaningful entity.Hence, entity delimiter is a new line, which is also the default entity delimiter.Hence, this attribute doesnt apply.
  • <CMND_KEY>.result.entity.field.delimiter:The entity field delimiter is SPACE here, which is also the default.So, this attribute doesnt apply.
  • <CMND_KEY>.result.sections: The output is just a single line.Hence, multiple sections doesnt apply.
  • <CMND_KEY>.result.header: The output is just a single line.Hence, headers doesnt apply.
  • <CMND_KEY>.result.footer: The output is just a single line.Hence, footers doesnt apply.
  • <CMND_KEY>.result.ignore: The output is just a single line.Hence, ignore doesnt apply.
  • <CMND_KEY>.result.entity.field.positional.map: Here each field of the entity is positional.Eg at 1st position we find Kernel version, 2nd position we find Node name etc.The field Build time spans multiple columns, viz column 4 to 11.
  • <CMND_KEY>.result.stop: N/A. The output is just a single line. No specific line where parsing should stop.
  • <CMND_KEY>.result.entity.field.parser: N/A as command parser is tabular.
  • <CMND_KEY>.nested.hierarchy.id: N/A as command parser is tabular.
  • <CMND_KEY>.nested.hierarchy: N/A as command parser is tabular.

Corresponding property file entry is given below:
cmnd1.parser.type=tabular
cmnd1.result.entity.field.positional.map=1:kernelName,2:nodeName,3:kernelVersion,4-11:buildTime,12:processorType,13:hwPlatform,14:processorArch,15:osName
route -n

Executing route -n on a Linux system generates the following output:
Alt text

Now, lets understand what attributes of jilapi property file matters in this case.

  • <CMND_KEY>.parser.type: The data can be visialized as tabular data.
  • <CMND_KEY>.entity.end: Here each single line contains a complete meaningful entity, which is a route entry.Hence, entity delimiter is a new line, which is also the default entity delimiter.Hence, this attribute doesn't apply.
  • <CMND_KEY>.result.entity.field.delimiter:The entity field delimiter is SPACE here, which is also the default.So, this attribute doesnt apply.
  • <CMND_KEY>.result.sections: There are no sections, as in there is only a single large section.Multiple sections dont apply.
  • <CMND_KEY>.result.header: The lines containing the fields Destination,Gateway,Genmask etc precede the actual route entries.So, the columns Destination,Gateway,Genmask etc is the header.
  • <CMND_KEY>.result.footer: Not applicable.
  • <CMND_KEY>.result.ignore: Not applicable.
  • <CMND_KEY>.result.entity.field.positional.map: Here each field of the entity is positional.Eg at 1st position we find Destination Network, 2nd position we find Gateway etc.
  • <CMND_KEY>.result.stop: N/A. No specific line where parsing should stop.
  • <CMND_KEY>.result.entity.field.parser: N/A as command parser is tabular.
  • <CMND_KEY>.nested.hierarchy.id: N/A as command parser is tabular.
  • <CMND_KEY>.nested.hierarchy: N/A as command parser is tabular.

Corresponding property file entry is given below:
cmnd2.parser.type=tabular
cmnd2.result.header=Destination,Gateway,Genmask,Flags,Metric,Ref,Use,Iface
cmnd2.result.entity.field.positional.map=1:destinationNw,2:gateway,3:netMask,5:metric,8:port
route print

Executing route print on a Windows system generates the following output:
Alt text Now, lets understand what attributes of jilapi property file matters in this case.

  • <CMND_KEY>.parser.type: The data can be visialized as tabular data.
  • <CMND_KEY>.entity.end: Here, a complete meaningful entity, which is a route entry, can be derived from a single line.Hence, entity delimiter is a new line, which is also the default entity delimiter.Hence, this attribute doesn't apply.
  • <CMND_KEY>.result.entity.field.delimiter:The entity field delimiter is SPACE here, which is also the default.So, this attribute doesnt apply.
  • <CMND_KEY>.result.sections: There are multiple sections of route entry here, viz IPv4 route entries and IPv6 route entries.
  • <CMND_KEY>.result.header: Each of the respective sections contain their own headers. Eg: IPv4 route entry has the headers as Network Destination,Netmask,Gateway,Interface and Metric.
  • <CMND_KEY>.result.footer: Each of the respective sections contain their own footers.Both IPV4 and IPV6 sections have ==== as the footer.
  • <CMND_KEY>.result.ignore: Not applicable.
  • <CMND_KEY>.result.entity.field.positional.map: Here each field of the entity is positional.Eg: In the case of IPv4 section,we find Destination Network at 1st position, Gateway at 2nd position etc.
  • <CMND_KEY>.result.stop: N/A. No specific line where parsing should stop.
  • <CMND_KEY>.result.entity.field.parser: N/A as command parser is tabular.
  • <CMND_KEY>.nested.hierarchy.id: N/A as command parser is tabular.
  • <CMND_KEY>.nested.hierarchy: N/A as command parser is tabular.

Corresponding property file entry is given below:
cmnd3.parser.type=tabular
cmnd3.result.sections=ipv4Route;ipv6Route
cmnd3.result.header=Network Destination,Netmask,Gateway,Interface,Metric;If,Metric,Network Destination,Gateway
cmnd3.result.footer=\=;\=
cmnd3.result.entity.field.positional.map=1:destinationNw,2:netMask,3:gateway,4:port,5:metric;1:field1,2:metric,3:destination,4:gateway

/etc/passwd

Executing cat /etc/passwd on a Linux system generates the following output:
Alt text
Now, lets understand what attributes of jilapi property file matters in this case.

  • <CMND_KEY>.parser.type: The data can be visialized as tabular data.
  • <CMND_KEY>.entity.end: Here, a complete meaningful entity, which is a route entry, can be derived from a single line.Hence, entity delimiter is a new line, which is also the default entity delimiter.Hence, this attribute doesn't apply.
  • <CMND_KEY>.result.entity.field.delimiter:The entity field delimiter is colon here.So, this attribute's value should be set as : in the property file.
  • <CMND_KEY>.result.sections: Not applicable here.
  • <CMND_KEY>.result.header: Not applicable here.
  • <CMND_KEY>.result.footer: Not applicable here.
  • <CMND_KEY>.result.ignore: Not applicable.
  • <CMND_KEY>.result.entity.field.positional.map: Here each field of the entity is positional.Eg: at 1st position we find the user name, 2nd position is the password and so on.
  • <CMND_KEY>.result.stop: N/A. No specific line where parsing should stop, as the required data has to be extracted till the EOF.
  • <CMND_KEY>.result.entity.field.parser: N/A as command parser is tabular.
  • <CMND_KEY>.nested.hierarchy.id: N/A as command parser is tabular.
  • <CMND_KEY>.nested.hierarchy: N/A as command parser is tabular.

Corresponding property file entry is given below:
cmnd4.parser.type=tabular
cmnd4.result.entity.field.positional.map=1:userName,2:passwd,3:userId,4:grpId,5:userFullName,6:homeDirectory,7:shellAccount
cmnd4.result.entity.field.delimiter=:
ifconfig -a

Executing ifconfig -a on a Linux system generates the following output:
Alt text Now, lets understand what attributes of jilapi property file matters in this case.

  • <CMND_KEY>.parser.type: The data can be visialized as chunked data.Multiple lines in the output describe a single entity ( interface)
  • <CMND_KEY>.entity.end: Here, a complete meanigful entity, which is a interface detail, can be derived from multiple lines, with demarkation being a empty line between two entities (interfaces).Hence, entity delimiter should be set as EMPTY_LINE.
  • <CMND_KEY>.result.entity.field.delimiter: The delimiter between multiple fields of an entity is SPACE.Hence, default holds good.
  • <CMND_KEY>.result.sections: Not applicable here.
  • <CMND_KEY>.result.header: Not applicable here.
  • <CMND_KEY>.result.footer: Not applicable here.
  • <CMND_KEY>.result.ignore: Not applicable here.
  • <CMND_KEY>.result.entity.field.positional.map: Not applicable here.
  • <CMND_KEY>.result.stop: N/A. No specific line where parsing should stop, as the required data has to be extracted till the EOF.
  • <CMND_KEY>.result.entity.field.parser: com.github.binitabharati.jilapi.entity.parser.impl.IfConfigParser. This parser decides how best to extract the data per entity (interface).
  • <CMND_KEY>.nested.hierarchy.id: N/A as command parser is chunked.
  • <CMND_KEY>.nested.hierarchy: N/A as command parser is chunked.

Corresponding property file entry is given below:
cmnd5.parser.type=chunked
cmnd5.entity.end=EMPTY_LINE
cmnd5.result.entity.field.parser=com.github.binitabharati.jilapi.entity.parser.impl.IfConfigParser
nested output

Consider the below nested output:
Alt text
Now, lets understand what attributes of jilapi property file matters in this case.

  • <CMND_KEY>.parser.type: The data can be visialized as nested data.
  • <CMND_KEY>.entity.end: Here, a complete meaningful entity, which is a route entry, can be derived from a single line.Hence, entity delimiter is a new line, which is also the default entity delimiter.Hence, this attribute doesn't apply.
  • <CMND_KEY>.result.entity.field.delimiter: The delimiter between multiple fields of an entity is SPACE.Hence, default holds good.
  • <CMND_KEY>.result.sections: Not applicable here.
  • <CMND_KEY>.result.header: Not applicable here.
  • <CMND_KEY>.result.footer: Not applicable here.
  • <CMND_KEY>.result.ignore: Not applicable here.
  • <CMND_KEY>.result.entity.field.positional.map: In this command output, there is a nested tabular data for element RAID Disk.
  • <CMND_KEY>.result.stop: In this command output, there are RAID Disk under element Local spares too. But, we do not want to consider those.So, parsing must stop when we encounter the String Local spares
  • <CMND_KEY>.result.entity.field.parser: There are multiple entities involved here. Aggregate, Plex, RAID Group require their own implementations of com.github.binitabharati.jilapi.entity.parser.EntityParser.Whereas, RAID Disk represents a tabular data. and can be handled by the OOB com.github.binitabharati.jilapi.parser.impl.TabularParser.
  • <CMND_KEY>.nested.hierarchy.id: Defines a way to identify each element in hierarchy. Must implement com.github.binitabharati.jilapi.parser.worker.NestedHierarchyIdentifier.
  • <CMND_KEY>.nested.hierarchy: Should define the hierarchy as per the guidelines.

Corresponding property file entry is given below:
cmnd6.parser.type=nested
cmnd6.nested.hierarchy.id=com.github.binitabharati.jilapi.parser.worker.impl.NestedHierarchyIdImpl
cmnd6.nested.hierarchy=%Aggregate%->[%Plex%->[%RAID group%->[%RAID Disk%]]]
cmnd6.result.entity.field.parser=Aggregate=com.github.binitabharati.jilapi.entity.parser.impl.AggregateParser;\
                                  Plex=com.github.binitabharati.jilapi.entity.parser.impl.PlexParser;\
                                  RAID group=com.github.binitabharati.jilapi.entity.parser.impl.RaidGroupParser;\
                                  RAID Disk=com.github.binitabharati.jilapi.parser.impl.TabularParser
cmnd6.result.entity.field.positional.map=RAID Disk=1:raidDisk,2:device,3:ha,4:shelf,5:bay,6:chan,7:pool,8:type,9:rpm,10:usedInMbPerBlocks,11:physicalInMbPerBlocks
cmnd6.result.footer=RAID Disk=EMPTY_LINE
cmnd6.result.ignore=RAID Disk=-
cmnd6.result.stop=Local spares

License

Copyright © 2016 Binita Bharati
Distributed under the Apache license 2.0.

About

Java based unstructured data parser

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages