forwardcom forum

Posted: **2018-08-25, 2:24:26**

Hi all – I thought this might interest you. It's a "vector-first" ISA: http://www.bitsnbites.eu/the-mrisc32-a- ... pu-design/

Posted: **2018-08-25, 5:03:06**

Thanks for the link. The MRISC32 ISA has some of the same ideas as ForwardCom. Maybe there is a basis for collaboration. I have added my comment at http://www.bitsnbites.eu/the-mrisc32-a- ... ent-219160

Posted: **2018-08-25, 17:47:05**

Hi,

Cool to see people taking interest in my project. While I agree that there are many similarities between ForwardCom and MRISC32, I think we have slightly different goals. Whereas ForwardCom is an attempt to design a robust, forward looking ISA, my own work with MRISC32 is mostly an attempt to learn about CPU architecture and getting to a point where I have a custom CPU that actually works (similar approach as the A2Z project: https://hackaday.io/project/18206-a2z-computer , but with a smaller scope). Since I have nothing to lose if my project doesn't succeed, I can afford to experiment with unconventional design choices. If anything, I hope that my project can inspire when it innovates, and teach when it fails.

With that said, I was intrigued by your suggestion of only setting the vector length when you load data into a register, and then use the register length for subsequent operations. It would be very elegant indeed. Initially I have a few concerns though, based on my experience with implementing the vector logic in the A1.

First off, in my design I need to know the vector operation length before accessing the vector register file. I have the vector register fetch in one pipeline stage, and I have the vector register element indexing logic in the pipeline stage before that. Especially for folding operations (which are a function of register fetch rather than instruction execution), I need to know the vector operation length before reading the register file.

The solution I have now is to keep a copy of the VL register in the instruction decode stage (before register fetch). I guess that the register length meta data could be held in a separate register file that is accessible to the vector indexing logic, but I think that it might be more complicated.

Another concern that I have is that you probably want to use the maximum length of the two source vector registers operands, instead of the length of first operand. That may require at least another pipeline stage for performing the max operation.

In any event - it's food for my thoughts. Thanks!

Posted: **2018-08-25, 20:07:51**

By the way, I'll ask you the same question I asked him. Is there any advantage to treating chars/strings as a basic type in an ISA, instead of treating them as integers? There should be fewer operations possible with them compared to integers – does that knowledge create any opportunities for optimization, either for a compiler or a microarchitecture?

Posted: **2018-08-26, 5:30:30**

mbitsnbites wrote:

in my design I need to know the vector operation length before accessing the vector register file. I have the vector register fetch in one pipeline stage, and I have the vector register element indexing logic in the pipeline stage before that.

I have solved that problem by requiring an explicit specification of the vector length in instructions that move data horizontally across vectors (because the latency may vary in large implementations). The vector length register is read in the address generation stage in the pipeline rather than in the execution stage. The vector space cannot be reconfigured in ForwardCom.

Another concern that I have is that you probably want to use the maximum length of the two source vector registers operands, instead of the length of first operand. That may require at least another pipeline stage for performing the max operation

The two vector operands will normally have the same length, except in the last iteration of a loop where an accumulator register may be longer than the last vector bit. The compiler or assembly programmer must remember to put the accumulator register as the first operand to avoid truncating it. The last vector register will be extended with zeroes to the same length. I cannot think of a case where a max operation is needed and the programmer doesn't know which one is longest.

JoeDuarte wrote:

Is there any advantage to treating chars/strings as a basic type in an ISA, instead of treating them as integers?

No, you can treat them as 8-bit integers if you use UTF-8 or as 16-bit integers for UTF-16. Anyway, human-readable text strings are rarely so long that execution time becomes critical.

Posted: **2018-08-26, 6:57:16**

Your solutions sound good. However I have one problem left, and that is that I really want to avoid having more than two source operands per instruction. The only exception in my current ISA is memory store, which takes data + base address + offset/stride, which I reccon is a reasonable exception because it's unlikely that you'd want to dispatch a multitude of memory store operations in paralllel in a superscalar implementation. Thus you can limit the number of read ports.

However, I guess that my VL register could be used for that purpose (for loads and folding operations), and have the register lengths control all other operands.

Btw, I see how the vector length will be automatically given in a loop with load + process + store. How about a pure generate + store loop?

Posted: **2018-08-26, 7:41:48**

agner wrote: ↑2018-08-26, 5:30:30 JoeDuarte wrote:
Is there any advantage to treating chars/strings as a basic type in an ISA, instead of treating them as integers?
No, you can treat them as 8-bit integers if you use UTF-8 or as 16-bit integers for UTF-16. Anyway, human-readable text strings are rarely so long that execution time becomes critical.

One thing that I think is taxing for most modern processors is searching for chars within strings, since you have a very long latency between data availability and branch evaluation. The most common problem is zero-terminated strings. With naive instructions this becomes a data dependent branched loop, where the content of every byte needs to be inspected, and it's impossible for the CPU to correctly predict the final branch. The situation is even worse if you have data dependent branch-based operations for every char in the string (e.g. convert to upper case, check for combinations of presence of alpha/numeric/whitespace, convert decimal ASCII to binary floating point etc).

From that point of view it may make sense to introduce some level of native string support in an ISA, but I think that it is sufficient to do it on an instruction level (there's no need to create a new data type). However, I would not dive into that rabbit hole without solid data and statistics of common string operations (a web browser / server core or an SQL database engine might be a good benchmark).

Posted: **2018-08-26, 9:27:26**

mbitsnbites wrote:

The most common problem is zero-terminated strings. With naive instructions this becomes a data dependent branched loop, where the content of every byte needs to be inspected, and it's impossible for the CPU to correctly predict the final branch.

This problem has already been solved. You read a full-length vector at a time and compare all the bytes to zero. The result of the compare is a boolean vector. Convert the boolean vector with n elements to an n-bit integer. Do a forward bit scan. This tells where the first zero is.

As this method reads beyond the end of the string, you have to avoid reading into non-existing memory space. This is done in x86 by aligning the vector boundary to a power-of-2 address to avoid crossing a page boundary. ForwardCom has no fixed-size memory pages. Instead I have specified that the OS must allocate an unused space of the same size as the maximum vector length after user memory space.

The situation is even worse if you have data dependent branch-based operations for every char in the string (e.g. convert to upper case, check for combinations of presence of alpha/numeric/whitespace, convert decimal ASCII to binary floating point etc).

A lot of these things can be done with masked vector operations. For example, my vector class library (x86) has functions for converting decimal ASCII to binary integer numbers (with fixed length).

The x86-SSE4.2 instructions can do more complex string operations such as searching for a substring. These instructions are expensive to implement i hardware, and they are rarely used because they require difficult assembly programming. I don't think these instructions are worth the cost because, as I said, human-readable strings are not so long that the performance becomes critical. The only application I can think of where SSE4.2 is important is DNA sequence analysis.

Posted: **2018-08-27, 10:59:48**

I have found that it can be beneficial to use SIMD for some text processing. Unfortunately, the SSE4.2 instructions are so slow that I find it can be better to use more elementary instructions instead.

Whilst the amount of text is probably not as great as other things, text processing is often highly inefficient relative to, for example, processing pixel data, as looping is often done byte-by-byte, and inside the loop is often quite branchy.
Also, there's a lot of protocols/formats out there which are text based; whilst parsing/serialization is often not a significant portion of overall CPU usage, introducing some SIMD can have a minor positive effect, in some cases, I find. Despite only having a minor effect, gains from CPU improvements are rapidly diminishing, so even just a few percent improvement is worth it, in my opinion.

I think the biggest problem is that few would bother using any hardware support for text processing if it existed, as agner mentions. I'm not sure what hardware could possibly do to accelerate text, but if SSE4.2 is anything to go by, it may not be worth the effort. Maybe some simplified version of the SSE4.2 instructions, which are easier to implement and more efficient in hardware? (e.g. substring search with shorter search string?)

Posted: **2018-08-28, 17:40:43**

For the string operations:
Code that deals with a a lot of strings is often very branchy and does complex addressing, even outside of string byte manipulations. This limits speed gains a lot.

Loops in that kind of code often have stuff like if(rareErrorCondition) {fprintf(stderr, "Error\n"); handleDeepException();} inside of the loop body, which defeats auto-vectorization. So you'd probably only end up with vectorized implementations for standard library text handling functions.

Posted: **2018-10-02, 6:18:10**

It would be interesting to make a "string machine" that is optimized for applications that are heavy on string processing. You would probably want new hardware backed string types, and you'd need compiler/language support.

For instance some applications (e.g. Firefox IIRC) use a neat trick: Register strings in a sorted array (I don't have the details in my head). The string handle then becomes a memory address that can be used directly for doing string comparisons (equal/greater than/less than) and copying with O(1) complexity. The costly part is updating the data structure with new strings. If that could be hardware accelerated and implemented as some sort of "string cache", you could probably make many programs an order of magnitude faster.

Posted: **2018-10-08, 1:38:57**

A string machine? How would you implement this string cache? Some kind of fast hardware hash function that processes 32 bytes at the time? Hardware accelerated UTF8 character loading and capitalization changes?

In particular, the hardware assisted string bank updating sounds really hard to build in hardware. It's kinda inherently a multi-cycle process, and it's going to be hard to save/restore it when switching contexts.

Posted: **2018-10-12, 6:20:57**

I don't know how it would be implemented. That's what makes it interesting

Like with most engineering tasks, you'll find a solution if you set your mind to it.

I think that I'd approach the problem by assuming that there are a few properties of a string that are very useful in common string operations (e.g. length, alphabetical order, etc), and that with a single string ID (e.g. a hash) that fits into a register you'd have O(1) access to these properties. Furthermore, raw string data (e.g. a UTF-8 byte buffer) will have to be turned quickly into this shorthand representation (exactly how that happens would have to be figured out, but you may want to do it asynchronously to avoid stalls).

Now for this to be at all useful you'd have to collect statistical data about string operations in real world applications to know what kind of string operations to optimize for, etc.

And maybe it's just a dead end.

Posted: **2018-10-28, 4:42:30**

Parabix is the most innovative project I've seen using hardware acceleration for string processing: http://parabix.costar.sfu.ca/

They also propose new instructions that would improve upon the SSE 4.2 and AVX-style approaches.

forwardcom forum

Interesting new ISA: MRISC32

Interesting new ISA: MRISC32

Re: Interesting new ISA: MRISC32

Re: Interesting new ISA: MRISC32

Re: Interesting new ISA: MRISC32

Re: Interesting new ISA: MRISC32

Re: Interesting new ISA: MRISC32

Re: Interesting new ISA: MRISC32

Re: Interesting new ISA: MRISC32

Re: Interesting new ISA: MRISC32

Re: Interesting new ISA: MRISC32

Re: Interesting new ISA: MRISC32

Re: Interesting new ISA: MRISC32

Re: Interesting new ISA: MRISC32

Re: Interesting new ISA: MRISC32