Different instruction sets on different cores

discussion of forwardcom instruction set and corresponding hardware and software

Moderator: agner

Post Reply
JoeDuarte
Posts: 41
Joined: 2017-12-19, 18:51:45

Different instruction sets on different cores

Post by JoeDuarte »

Hi Agner – Do we need every core to support the same registers and instructions? There is some evidence that a logarithmic number system would be more efficient than floating point for many workloads (https://en.wikipedia.org/wiki/Logarithmic_number_system). It would be nice to have floating point on two cores, and logarithmic on two other cores, for example.

And if ForwardCom were to have AES and other crypto instructions, it seems like it would be fine to have them on just one core. There's no need to have that on every core – they won't be used.

Separately, how would ForwardCom fare with strings compared to SSE 4.2? I don't see any comparable instructions.
agner
Site Admin
Posts: 192
Joined: 2017-10-15, 8:07:27
Contact:

Re: Different instruction sets on different cores

Post by agner »

A logarithmic number system is efficient as long as you are using it for multiplication only, but difficult if you want to do addition. You need no extra hardware for multiplying logarithmic numbers - this is simply addition of integers. Another possibility is to use standard floating point numbers and add the exponents. ForwardCom has an instruction mul_2pow that adds an integer n to the exponent of a floating point number. This corresponds to multiplying by 2^n, or dividing if n is negative. This does floating point multiplication at the speed of integer addition.

I have not implemented something like Intel's SSE4.2 instructions for the following reasons:
  • These instructions are used mainly for manipulating human-readable text. Such texts are usually so short that execution time is negligible. Only applications such as DNA analysis are critical.
  • I don't want complicated instructions that need to be split up into micro-operations. This makes the whole pipeline more complicated and slower.
  • SSE4.2 is rarely used because it doesn't easily integrate into high level programming languages.
  • You can have an FPGA for application-specific instructions. This can be used for SSE4.2-like operations, cryptographic instructions, etc.
-.-
Posts: 5
Joined: 2017-12-24, 5:10:47

Re: Different instruction sets on different cores

Post by -.- »

JoeDuarte wrote: 2017-12-19, 19:00:38 And if ForwardCom were to have AES and other crypto instructions, it seems like it would be fine to have them on just one core. There's no need to have that on every core – they won't be used.
I would've thought that a very common application of crypto acceleration would be a multi-threaded HTTPS/VPN/etc server, where the acceleration units would need to be on each core to be used. You could just lock the server to one core, but then you'll be unable to use the other cores on the chip. Alternatively, you could have a process/thread running on the "crypto core" and pass data back and forth between the server's worker threads and the crypto thread, but that'd complicate the programming model a little (not too sure how much of a performance penalty this is) - still, it'd work I suppose.

It's interesting to note that Intel has announced the AVX512 VAES extension for upcoming Icelake processors, which can encrypt 4 streams in parallel. I don't know what purpose this is aimed at, but clearly they see a benefit for enabling more parallel encryption (or maybe it helps accelerate a single stream AES-CTR, though it being released along with VPCLMUL seems to suggest 4 parallel AES-GCM streams being the aim).

I've never done any work with FPGAs so cannot comment how it'd compare with a "dedicated" crypto core.
Kulasko
Posts: 32
Joined: 2017-11-14, 21:41:53
Location: Germany

Re: Different instruction sets on different cores

Post by Kulasko »

-.- wrote: 2017-12-24, 5:28:48 It's interesting to note that Intel has announced the AVX512 VAES extension for upcoming Icelake processors, which can encrypt 4 streams in parallel. I don't know what purpose this is aimed at, but clearly they see a benefit for enabling more parallel encryption (or maybe it helps accelerate a single stream AES-CTR, though it being released along with VPCLMUL seems to suggest 4 parallel AES-GCM streams being the aim).

I've never done any work with FPGAs so cannot comment how it'd compare with a "dedicated" crypto core.
FPGA implementations have a few drawbacks compared to ASIC implementations, the most notable perhabs being the attainable clock rate of a given block of logic (typically a few hundred Mhz today), therefore you will see higher latency. However, the forwardcom-ISA should cover the vast majority of latency-sensitive algorithms, as it describes a general purpose processor. For throughput-sensitive algorithms, you usually can just increase parellelism. In theory, you can design a wider FPGA implementation with a higher total throughput than a narrower ASIC implementation.

A current idea for forwardcom is to integrate FPGAs in CPU cores, the current specification version has reserved instruction codes for this purpose. It should be possible to supply a libary for the FPGA programming (by the operating system?) and then using these designs as one would use regular instruction extensions in other architectures. Of course, the supplied algorithm has to exploit enough parallelism and the program has to tell the operating system what algorithm it wants to run.
-.-
Posts: 5
Joined: 2017-12-24, 5:10:47

Re: Different instruction sets on different cores

Post by -.- »

I'd imagine that mostly serial encryption, such as AES-CBC, would suffer, speed-wise, on an FPGA compared to a CPU with dedicated AES instructions, though mostly parallel methods like AES-CTR could be better (for large enough amounts of data).

I haven't really looked at what ForwardCom provides though, so maybe it has other mitigations in place.
JoeDuarte
Posts: 41
Joined: 2017-12-19, 18:51:45

Re: Different instruction sets on different cores

Post by JoeDuarte »

-.- wrote: 2017-12-24, 5:28:48
JoeDuarte wrote: 2017-12-19, 19:00:38 And if ForwardCom were to have AES and other crypto instructions, it seems like it would be fine to have them on just one core. There's no need to have that on every core – they won't be used.
I would've thought that a very common application of crypto acceleration would be a multi-threaded HTTPS/VPN/etc server, where the acceleration units would need to be on each core to be used. You could just lock the server to one core, but then you'll be unable to use the other cores on the chip. Alternatively, you could have a process/thread running on the "crypto core" and pass data back and forth between the server's worker threads and the crypto thread, but that'd complicate the programming model a little (not too sure how much of a performance penalty this is) - still, it'd work I suppose.
I've never done any work with FPGAs so cannot comment how it'd compare with a "dedicated" crypto core.
You're right about the server use case. I was thinking of clients – desktop and mobile – where the crypto load is trivial (one webpage every minute?). Though ideally the disk/file system should be encrypted by default, and I'm not sure what kind of load that would generate.

I think it's suboptimal to have the same ISA for servers and clients, and to have such a vast number of instructions supported by every core. I'm not even sure that most programs use floating point math, for instance, so it's odd to have so much FP hardware and instruction support on every core. I don't think having a different ISA for servers vs. clients would be much trouble for developers, since developing applications for mobile and desktop is already quite different from server development and most developers don't interact with the ISA directly. 64-bit is such a waste on phones and desktops – I think something like 40-bit registers and address space would be optimal for client devices.

Agner seems to think that we can dish a bunch of work to FPGAs, like encryption. But I think it's very unlikely that OEMs will want to include FPGAs in most devices. Maybe he's just thinking of servers, which is more feasible, but even then I don't know that FPGAs will ever be common. An FPGA is going to add to the BOM and cost, and I doubt many customers will be clamoring for them. Very few developers have any experience with FPGAs, and they seem to be niche devices for things like high frequency trading. It's hard enough to get developers to use modern CPU instructions, vectorization, etc.
JoeDuarte
Posts: 41
Joined: 2017-12-19, 18:51:45

Re: Different instruction sets on different cores

Post by JoeDuarte »

agner wrote: 2017-12-20, 17:55:30 A logarithmic number system is efficient as long as you are using it for multiplication only, but difficult if you want to do addition. You need no extra hardware for multiplying logarithmic numbers - this is simply addition of integers. Another possibility is to use standard floating point numbers and add the exponents. ForwardCom has an instruction mul_2pow that adds an integer n to the exponent of a floating point number. This corresponds to multiplying by 2^n, or dividing if n is negative. This does floating point multiplication at the speed of integer addition.
Agner, I got the impression that a logarithmic number system benefits greatly from a hardware implementation, like the European Logarithmic Microprocessor. For example:

https://www.ece.ucsb.edu/~parhami/pubs_ ... to-flp.pdf

http://ieeexplore.ieee.org/document/715 ... eload=true

If it's just integer addition, what are these hardware implementations implementing?
agner
Site Admin
Posts: 192
Joined: 2017-10-15, 8:07:27
Contact:

Re: Different instruction sets on different cores

Post by agner »

Joe,
In a logarithmic number system, multiplication and division become simpler, but addition and subtraction become much more complicated. Your links confirm this. A program with an equal number of additions and multiplications will be faster on a floating point computer than on a logarithmic processor. Precision is also an issue. An integer can be expressed exactly in a floating point system, but not in a logarithmic system.
Kulasko
Posts: 32
Joined: 2017-11-14, 21:41:53
Location: Germany

Re: Different instruction sets on different cores

Post by Kulasko »

JoeDuarte wrote: 2018-01-22, 2:36:11 You're right about the server use case. I was thinking of clients – desktop and mobile – where the crypto load is trivial (one webpage every minute?). Though ideally the disk/file system should be encrypted by default, and I'm not sure what kind of load that would generate.

I think it's suboptimal to have the same ISA for servers and clients, and to have such a vast number of instructions supported by every core. I'm not even sure that most programs use floating point math, for instance, so it's odd to have so much FP hardware and instruction support on every core. I don't think having a different ISA for servers vs. clients would be much trouble for developers, since developing applications for mobile and desktop is already quite different from server development and most developers don't interact with the ISA directly. 64-bit is such a waste on phones and desktops – I think something like 40-bit registers and address space would be optimal for client devices.
What you are aming at basically is a client specialized architecture. That might be optimal from a pure space/power efficiency point of view, however, it would need twice the development work (OS, compilers etc need to be developed too), it would be binary incompatible with all other kinds of devices and in case of your 40-bit proposal, it could introduce nasty bugs as developers are accustomed to 32 or 64 bit integer and floating point numbers. Also, it might run into the same addressing wall we ran with 32 bit in the early 2000s if it survives for a decade or more.
An ISA is a mere specification, you can vary a lot of things through implementation. For example, you could build a forwardcom processor with 128 bit vectors for clients, one with 256 bits for servers, and one with 8192 bits for scientific computing. In that regard, forwardcom allows very high flexibility. Also, there is no need to implement all instructions efficiently if they are rarely used in the environment you design your processor for. A good part of the more advanced instructions in forwardcom are even fully optional.
JoeDuarte wrote: 2018-01-22, 2:36:11 Agner seems to think that we can dish a bunch of work to FPGAs, like encryption. But I think it's very unlikely that OEMs will want to include FPGAs in most devices. Maybe he's just thinking of servers, which is more feasible, but even then I don't know that FPGAs will ever be common. An FPGA is going to add to the BOM and cost, and I doubt many customers will be clamoring for them. Very few developers have any experience with FPGAs, and they seem to be niche devices for things like high frequency trading. It's hard enough to get developers to use modern CPU instructions, vectorization, etc.
The current forwardcom proposal integrates an FPGA in every CPU core. It would be possible to supply FPGA programs for different instruction extensions by the OS, so a programmer could use them as if they were a native part of the ISA. However, the speed disadvantage versus a native implementation will remain und might be a critical problem in egde cases. In those cases, an ISA extension might be unavoidable.
-.-
Posts: 5
Joined: 2017-12-24, 5:10:47

Re: Different instruction sets on different cores

Post by -.- »

JoeDuarte wrote: 2018-01-22, 2:36:11I was thinking of clients – desktop and mobile – where the crypto load is trivial (one webpage every minute?). Though ideally the disk/file system should be encrypted by default, and I'm not sure what kind of load that would generate.
True that most clients won't need that much crypto. Could increase a bit in the future (DRM, HTTPS, encrypted communications, I/O etc), though it's still likely not that much. Disks are usually slow enough that they don't have much of an impact on CPU, but with faster storage becomming readily available, this can change too. Modern SSDs are often have built in encryption (self encrypting drives or SEDs), but there can be trust issues with using those (e.g. often insecurely implemented by the manufacturer).
JoeDuarte wrote: 2018-01-22, 2:36:11I think it's suboptimal to have the same ISA for servers and clients
It is indeed more optimal to target your chips for the applications running on them, but I agree with Kulasko that there's also a cost to having different ISAs between client/server. There's a reason why the overwhelming majority of servers run x86...
JoeDuarte wrote: 2018-01-22, 2:36:11I'm not even sure that most programs use floating point math, for instance, so it's odd to have so much FP hardware and instruction support on every core
I don't really write programs which use much FP math, but off the top of my head, I'd imagine that games would make heavy use of FP, along with some media content creation applications, and possibly even web page rendering. A number of scripting languages, such as Javascript, exclusively use 64-bit floats for their number representation (though JIT engines may be able to optimise these into ints).
JoeDuarte wrote: 2018-01-22, 2:36:11since developing applications for mobile and desktop is already quite different from server development
Funnily enough, node.js is really hot in the server-side web application development space at the moment - one of its key selling points being that it uses the same language as that used in the browser, and hence, libraries can be shared across the two. (though web devs often change their technology stack every few years, so this may not last)
Unrelated to ISA details, but I just felt like pointing it out anyway. I do agree that development work between clients and servers are generally quite different, not to mention that desktop/mobiles often have an x86/ARM split.
JoeDuarte wrote: 2018-01-22, 2:36:1164-bit is such a waste on phones and desktops – I think something like 40-bit registers and address space would be optimal for client devices
40-bit does sound interesting, and I can't really disagree with not needing any more, but I think in this day and age, not having 64-bit support may be a hard sell. A bunch of code and algorithms (e.g. off the top of my head, SHA512) are already optimised to make use of 64-bit CPUs.
1TB of addressable memory does sound like a potential limitation though. It's not unusual for servers to have this much memory these days, and it's likely it'll become common in clients in the future. Also, if non-volatile memory storage solutions become popular in the future, and OSes see a benefit in mapping disk into RAM, 1TB would definitely be limiting.
HubertLamontagne
Posts: 80
Joined: 2017-11-17, 21:39:51

Re: Different instruction sets on different cores

Post by HubertLamontagne »

JoeDuarte wrote:I'm not even sure that most programs use floating point math, for instance, so it's odd to have so much FP hardware and instruction support on every core.
Video games use TONS of floating point math. Everything that is a 3d coordinate is going to have 32bit floating point XYZ. If your scene has 10 million vertexes, that's 30 million floating point values just for the coordinates (plus another 30 million for the normals), that need to be processed each frame (= 60 times per second). The whole point of the PS2's "Emotion engine" (customized MIPS + 2 high bandwidth 4x32bit vector units) was to do as many float multiplies and adds as possible. The whole point of the PS3's infamous "Cell" processor was also to do as many float multiplies and adds as possible. The Iphone was pretty much the first cell phone with an FPU and that's exactly when 3d games on cell phones exploded.
agner
Site Admin
Posts: 192
Joined: 2017-10-15, 8:07:27
Contact:

Re: Different instruction sets on different cores

Post by agner »

Hubert wrote:
Video games use TONS of floating point math
ForwardCom has optional support for half precision floating point vectors. Do you think that video and sound applications can use half precision? Neural networks is another application for half precision.

The operand type field in the instruction template has 3 bits giving 23 = 8 types: int8, int16, int32, int64, int128, float32, float64, float128.
As you see, I have given priority to possible 128-bit extensions so there is no space for float16. Instead, half precision instructions are implemented as single-format instructions without memory operand. You need to use int16 instructions for memory read and write.
JoeDuarte
Posts: 41
Joined: 2017-12-19, 18:51:45

Re: Different instruction sets on different cores

Post by JoeDuarte »

HubertLamontagne wrote: 2018-02-01, 16:39:33
JoeDuarte wrote:I'm not even sure that most programs use floating point math, for instance, so it's odd to have so much FP hardware and instruction support on every core.
Video games use TONS of floating point math. Everything that is a 3d coordinate is going to have 32bit floating point XYZ. If your scene has 10 million vertexes, that's 30 million floating point values just for the coordinates (plus another 30 million for the normals), that need to be processed each frame (= 60 times per second). The whole point of the PS2's "Emotion engine" (customized MIPS + 2 high bandwidth 4x32bit vector units) was to do as many float multiplies and adds as possible. The whole point of the PS3's infamous "Cell" processor was also to do as many float multiplies and adds as possible. The Iphone was pretty much the first cell phone with an FPU and that's exactly when 3d games on cell phones exploded.
Would games br better off with logarithmic number system hardware instead of FP?
JoeDuarte
Posts: 41
Joined: 2017-12-19, 18:51:45

Re: Different instruction sets on different cores

Post by JoeDuarte »

JoeDuarte wrote: 2018-01-22, 2:36:1164-bit is such a waste on phones and desktops – I think something like 40-bit registers and address space would be optimal for client devices
40-bit does sound interesting, and I can't really disagree with not needing any more, but I think in this day and age, not having 64-bit support may be a hard sell. A bunch of code and algorithms (e.g. off the top of my head, SHA512) are already optimised to make use of 64-bit CPUs.
1TB of addressable memory does sound like a potential limitation though. It's not unusual for servers to have this much memory these days, and it's likely it'll become common in clients in the future. Also, if non-volatile memory storage solutions become popular in the future, and OSes see a benefit in mapping disk into RAM, 1TB would definitely be limiting.


I was only thinking of 1 TB for clients, not servers. You might have noticed that RAM on desktops, laptops, and smartphones has virtually hit a wall. There's very little growth at this point. The iPhone has been stuck at 2 GB for years. Premium Android devices usually sport 4-6 GB. High-end laptops are still coming with 8, 12, or 16 GB (some have a theoretical max of 32 GB, and some, like Apple's useless port-free laptops, are capped at 16 GB).

I realize that Bill Gates made that infamous comment about how we'd never need more than 64 KB of RAM or something, but the fact that he was way off doesn't mean that there isn't actualy a number that we'll never need to surpass. It looks like it will be many years before 64 GB is normal in a desktop/laptop, and I don't think it will ever be normal on mobile (unless we're talking the year 2100).

1 TB will be far more than enough for clients for several decades. The only thing I wonder about is how tagged memory will work with a 40-bit address space. Would it require more bits? I've been fascinated by the CHERI CPU project and ISA: http://www.cl.cam.ac.uk/research/security/ctsrd/cheri/

And the TAXI proposal: https://people.csail.mit.edu/hes/ROP/Pu ... thesis.pdf
HubertLamontagne
Posts: 80
Joined: 2017-11-17, 21:39:51

Re: Different instruction sets on different cores

Post by HubertLamontagne »

I've never seen half-float being used. Not in game code, and not in sound applications (where 32bit float is very much the sweet spot). There's very little x86 support - only AVX conversion instructions to and from 32bit float vectors (vcvtps2ph and vcvtph2ps). There is no standard C/C++ type name for it either (the only trace of half float on x86 is the AVX conversion intrinsics).
JoeDuarte
Posts: 41
Joined: 2017-12-19, 18:51:45

Re: Different instruction sets on different cores

Post by JoeDuarte »

HubertLamontagne wrote: 2018-02-02, 18:11:06 I've never seen half-float being used. Not in game code, and not in sound applications (where 32bit float is very much the sweet spot). There's very little x86 support - only AVX conversion instructions to and from 32bit float vectors (vcvtps2ph and vcvtph2ps). There is no standard C/C++ type name for it either (the only trace of half float on x86 is the AVX conversion intrinsics).
Hi Hubert, I've lost the plot here a bit. Why are you talking about half-float? Is this related to my enthusiasm for 40-bit registers and address spaces for client devices? How so?

In any case, half-float, by which I assume you mean 16-bit FP, is extremely relevant right now, much more so than it was even ten years ago. It features prominently in a lot of deep learning APIs and platforms, most recently in NVIDIA's new Volta "GPU" architecture with its plethora of dedicated tensor cores (I put "GPU" in quotes because this product is no more a GPU than my rooftop antenna – it's meant exclusively for data centers, particularly for deep learning applications. Perhaps one day the Volta architecture will be spun into a GPU, and one can even dream that cryptocurrrency miners won't make it impossible to actually buy these "GPUs" for ≤ 120% of their MSRP.)

Since interesting expansions on the 16-bit FP renaissance:

https://devblogs.nvidia.com/mixed-preci ... ng-cuda-8/

Facebook's Caffe2 platform: https://caffe2.ai/blog/2017/05/10/caffe ... pport.html

Deep dive into Volta: https://devblogs.nvidia.com/inside-volta/

With my proposed 40-bit platform, I imagine specifying 20, 40, and 80-bit integers and FP. I think 20-bit integers and floats would be more useful in many cases than 16-bit. And the 80-bit floats perfectly sync up with the 80-bit Extended Precision FP that IEEE sort of documents already. I think Intel uses 80-bit floats when doing math on doubles. The 20, 40, and 80 bit floats would have to be very rigorously specified, much like the recent IEEE specs (but it should be free and open source, not cost an arm and a leg like the IEEE standards or the C++ standard).

There's also the new ISO/IEC standard which is much broader than floating point: https://en.wikipedia.org/wiki/ISO/IEC_10967

And I'd want a logarithmic number system IF the requisite empirical research tells us that it would be a significant benefit for many programs. (And yes, we'd have to sort out what we mean by "significant" and "many" and so forth.)

I assume a 20/40/80-bit platform could easily support legacy 16/32/64-bit types by padding or other means.

I also like the idea of 320-bit vector registers. 8 40-bit values. 10 32-bit values. 4 80-bit. From what I've read, I'm not sure that huge vectors of the sort Agner wants are efficient. Isn't AVX-512 underperforming right now?

Finally, I think core type bit lengths, register sizes, address space, vector length, etc. should all be chosen by rigorous empirical research on what is optimal for the kind of operating system we want (and we really should want new, clean-sheet OSes), and the applications we expect to run on them. My 20/40/80 business is really just a hunch of near optimality for client devices. But the optimal values could be quite different, and innovations in semiconductor manufacturing and hardware design could enable a whole new set of optimal parameters.
-.-
Posts: 5
Joined: 2017-12-24, 5:10:47

Re: Different instruction sets on different cores

Post by -.- »

JoeDuarte wrote: 2018-02-02, 13:38:18 You might have noticed that RAM on desktops, laptops, and smartphones has virtually hit a wall. There's very little growth at this point. The iPhone has been stuck at 2 GB for years. Premium Android devices usually sport 4-6 GB. High-end laptops are still coming with 8, 12, or 16 GB (some have a theoretical max of 32 GB, and some, like Apple's useless port-free laptops, are capped at 16 GB).
This does seem to be the case. I'd say there's not really a need for more RAM on most client machines. The other side is likely the RAM pricing in recent times due to supply shortages.
The iPhone probably has its own reasons for its limitations, also Intel limits its client CPUs to 32-64GB RAM, presumably to stop people running servers on them. Hence, these may also be factors.

However I suspect this "wall" is mostly just an economical one, not a technical one. 128GB DIMMs are available now, and consumer motherboards with 4-8 DIMM slots are not uncommon. It's hard to guess economic conditions 10-20 years down the track, and I don't think an ISA should be making such heavy bets about it.
JoeDuarte wrote: 2018-02-02, 13:38:18 It looks like it will be many years before 64 GB is normal in a desktop/laptop
I'd agree with that, but I think an ISA should firstly consider the requirements of its high end users (since if it can, it'll work with common users). I do know a few workstations (for multimedia processing) which have 64GB RAM installed.

My personal home computer has 32GB RAM installed. This guy's personal desktop has 128GB RAM installed [ https://jmvalin.dreamwidth.org/15583.html ].
HubertLamontagne
Posts: 80
Joined: 2017-11-17, 21:39:51

Re: Different instruction sets on different cores

Post by HubertLamontagne »

Anyhow, the most common customizations you see on CPUs are for making smaller embedded versions. The most common configurations you see in the wild:

- 32bit, no MMU, no FPU, no SIMD: fast micro-controller. Truckloads of ARM and MIPS use this, such as STM32's which are taking over the hardware world.
- 32bit, FPU: fast micro-controller, very useful if you need lightweight systems that do DSP processing (ex: guitar effect pedals) (larger STM32's).
- 32bit, MMU: small cpu that runs complex OS's such as Linux. Lots of first generation Android phones used this, plus routers etc.
- 32bit, MMU+FPU: the iPhone configuration. Does both complex OS and DSP. Classic configuration that has broad applicability to lots of software.
- 64bit and SIMD and virtualization-support are added to the 32+MMU+FPU config, 64bit to address large amounts of RAM, SIMD to boost DSP perf
JoeDuarte
Posts: 41
Joined: 2017-12-19, 18:51:45

Re: Different instruction sets on different cores

Post by JoeDuarte »

agner wrote: 2018-02-02, 6:57:27 Hubert wrote:
Video games use TONS of floating point math
ForwardCom has optional support for half precision floating point vectors. Do you think that video and sound applications can use half precision? Neural networks is another application for half precision.

The operand type field in the instruction template has 3 bits giving 23 = 8 types: int8, int16, int32, int64, int128, float32, float64, float128.
As you see, I have given priority to possible 128-bit extensions so there is no space for float16. Instead, half precision instructions are implemented as single-format instructions without memory operand. You need to use int16 instructions for memory read and write.
Hi Agner, I think 16-bit FP has actually become more popular in recent years, so not treating it as a first-class citizen may be a mistake. It's not just popular in games, but in deep learning applications and imaging. GPU makers have intensified their support for it lately, and nVidia's new tensor cores center on it. Google's TensorFlow ASICs also depend on it, I think. Apparently 16-bit is optimal for deep learning because it offers the right compromise of precision and speed. Now, you could say all this stuff can be handled by GPUs, not a CPU instruction set, but there's evidence that 16-bit will be used a lot by CPUs, like the introduction of the F16 instructions for conversion, and the fact that 16-bit FP is used in some imaging formats for High Dynamic Range. Imaging won't always be offloaded to GPU – in fact, right now it rarely is on desktop platforms. You can see some of the formats that depend on 16-bit here: https://en.wikipedia.org/wiki/Half-prec ... int_format

ImageMagick even releases special versions that support 16-bit per channel formats. I don't know if it's integer or FP, but I think it's the latter since they mention OpenEXR: http://imagemagick.org/script/download.php#windows
Post Reply