File access READ-FILE and WRITE-FILE bytes vs characters? #145

SirWumpus · 2023-03-31T12:39:52Z

SirWumpus
Mar 31, 2023

In Forth 2012 draft 19.1 section 11 File Access words, such as READ-FILE and WRITE-FILE, talk about reading / writing characters. I suspect the text was meant to talk in terms of bytes (octets) since there are no other words in section 11 to read / write bytes. Referring to "characters" could imply UTF-8 or ASCII (UTF-8 subset), which involves additional considerations.

The READ-LINE and WRITE-LINE words also refer to characters instead of bytes. The size of a buffer in bytes can be very different from one counting UTF-8 characters and the handling of multibyte UTF-8 sequences.

Given that the File Access words mirror similar functions of C stream I/O or POSIX file I/O, both which operate with bytes in mind, should the terminology used by section 11 be rephrased? I suspect the reference to "characters" is legacy, but with UTF-8 support the distinction between characters and bytes needs to be more clear.

ruv · 2023-04-01T16:18:47Z

ruv
Apr 1, 2023
Maintainer

It was unexpected to me too that Forth-94 does not provide a portable way to read/write octets (not only in a file, but in memory too). You can only detect the character size and employ different custom implementations depending on that.

The characters were intentionally made independent from octets. Ditto for address units. An API to read/write octets is possible anyway, but it was not designed.

Forth-2012 does not provide any means for octet-oriented I/O too. A character size is at least one octet, and it may be more than one octet.

In the next version a character size is always one address unit, but an address unit can still be more than one octet. In some Forth implementations 1 chars = 1 cells = 1 address unit that is 4 octets (for example, jsForth).

If a program assumes that a character size is 1 octet, then this program has an environmental dependency.

with UTF-8 support the distinction between characters and bytes needs to be more clear.

A notion of primitive character was introduced in Forth-2012, but there is still a number of lacunae and inconsistencies.

3 replies

ruv Apr 2, 2023
Maintainer

@SirWumpus wrote:

should the terminology used by section 11 be rephrased?

It's impossible to consistently rephrase the only terminology of the section 11 in such a way that the file I/O becomes octet-oriented.

There are two other possible ways:

Require an address unit to be exactly 1 octet (8 bits). At the moment, the number of bits in one address unit is implementation-defined and must be at least 8 bits (see the sections 3.1.2 Character types, 4.1.1 Implementation-defined options).
Design a separate API for octet-oriented I/O — it could be just a set of words in a separate word list.

Anyway, if a program needs octet-oriented I/O, the simplest way now is just to declare the corresponding environmental dependency. It can check this condition as:

-1 pad c! pad c@ 255 <> [if] .( This program needs an address unit be 1 octet. Abort. ) abort [then]

SirWumpus Apr 2, 2023
Author

Require an address unit to be exactly 1 octet

That would probably cause more issues through-out the draft.

Design a separate API for octet-oriented I/O

Don't think a separate word set would be necessary, just add a couple of words to FILE EXT maybe: READ-BYTES-FILE and WRITE-BYTES-FILE.

Or maybe just use BIN mode since its definition sort of leaves it open to some interpretation:

    ... to additionally select a “binary”, i.e., not line oriented, file access method, ...

If its not line-oriented, then "binary" could be interpreted as being octet oriented. Some more rationale text explaining this might be all that is required. The current A.11.6.1.0765 BIN text could be interpreted that way.

ruv Apr 3, 2023
Maintainer

Don't think a separate word set would be necessary, just add a couple of words to FILE EXT maybe: READ-BYTES-FILE and WRITE-BYTES-FILE.

According to the naming convention, they could be READ-FILE-BYTES and WRITE-FILE-BYTES (variants: READ-FILE-OCTETS, READ-FILE-OCTONARY, READ-FILE-OCTUPLY).

These words accept/return a length in octets. So, in some cases this length can be less then a character (and less than an address unit). Then, how can we process such a piece of data, which on one system is less than an address unit, on another system is greater than an address unit? I mean, a portable program should detect an address unit size and employ different algorithms depending on that. Also it should solve the problem of aligning in some cases. It's cumbersome.

One alternative is to employ an octet-oriented API to access memory too (as I mentioned). It should also support read/write 16-bits units, 32-bit units, 64-bits units, in specified endianness, on any offset (that is calculated in octets), and it should not require aligning.

Another way is to unpack data from octets to characters (primitive characters) on reading from a file, and pack from characters to octets on writing (higher bits are lost). Of course, such packing/unpacking actually takes place only on a system having an address unit more than 8 bits. One problem is that you cannot use @, !, w@, w!, etc, to access this data. Only c@ and c! can be used to access this data.

Or maybe just use BIN mode since its definition sort of leaves it open to some interpretation

Don't sure. Maybe better just introduce a new mode, e.g. BINO, to ensure backward compatibility.
Will it affect file-size, file-position, and reposition-file? Probably, in this mode they shall return a size/position in octets too.
How it will affect write-line, read-line, and include-file?

The only question: will at least some standard systems, which have an address unit more than 8 bits, provide these features? I mean, a special mode of file opening, or additional words to read/write files, or an octet-oriented API to access memory?

If not, — we don't need to bother on this at all. An environmental dependency "1 address unit = 1 octet" is enough.

SirWumpus · 2023-04-03T12:58:36Z

SirWumpus
Apr 3, 2023
Author

... READ-FILE-OCTONARY, READ-FILE-OCTUPLY).

I was half expecting to see READ-FILE-OCTOPUSSY (cue Bond music) in that list ;-)

Anyway, I see how cumbersome it can be to solve when the address unit is greater than 8 bits.

Maybe better just introduce a new mode, e.g. BINO, to ensure backward compatibility.

I had considered a new mode too and it probably would be easiest; just figured BIN was vague enough to fill the gap with least change. This is probably only a concern when dealing with UTF-8 text files, which could be externally transcoded to UTF-16 or UTF-32. Then again a CPU with larger address units is probably special purpose and not going to work with common text files.

... will at least some standard systems, which have an address unit more than 8 bits, provide these features?

Guess that is the big question. Would anyone else see this as useful enough to implement?

If not, — we don't need to bother on this at all. An environmental dependency "1 address unit = 1 octet" is enough.

I see now how that makes sense.

@ruv Thank you for your insight and time.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File access READ-FILE and WRITE-FILE bytes vs characters? #145

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

File access READ-FILE and WRITE-FILE bytes vs characters? #145

SirWumpus Mar 31, 2023

Replies: 2 comments · 3 replies

ruv Apr 1, 2023 Maintainer

ruv Apr 2, 2023 Maintainer

SirWumpus Apr 2, 2023 Author

ruv Apr 3, 2023 Maintainer

SirWumpus Apr 3, 2023 Author

SirWumpus
Mar 31, 2023

Replies: 2 comments 3 replies

ruv
Apr 1, 2023
Maintainer

ruv Apr 2, 2023
Maintainer

SirWumpus Apr 2, 2023
Author

ruv Apr 3, 2023
Maintainer

SirWumpus
Apr 3, 2023
Author