Thursday, September 10, 2015

Recently I have been focusing entirely on UARC core0, my first UARC processor project. As a C++ programmer, I regularly like to take advantage of generic containers in the Standard Template Library (STL). When you are programming, there are interesting data structures to use, all of which could use generic template classes. Occasionally you have to implement your own, but most often the STL has all the basic structures you need. Generics have some other nice uses, like making templated virtual base classes so that you can use polymorphism on multiple objects that all share the same public interface, but might use slightly different types or other variables, so long as all of those objects inherit from the same version of the base class. However, C++ is not Verilog.

In Verilog, so many things that should be obviously supported are not. For instance, one cannot pass a multidimensional array of wires into a module, which is basically the "class" of Verilog. This is a problem if you want to pass an array of n-bit values into the module. If you do, then you have to transform it into one single array and then back inside the module—what a pain! Many of these issues are fixed in SystemVerilog, which introduces so many useful features, but is still not universally supported. Since Verilog is ubiquitous, I have to use it for my processor.

Verilog does have one feature that I think would make it completely useless without: parameters. It IS possible to make generic modules in Verilog that have variable inputs, outputs, or constant values. I used this in my VGA controller to allow for different VGA timings and modes. The timing is set through parameters. In core0, I must be able to have a variable amount of targets to communicate with connected through the inputs. This works fine for most purposes, but one particular thing annoyed me to no end today. I have to be able to handle interrupts from multiple targets inside the core itself, but I need to be able to get a number so I can look up the PC to set for a particular interrupt. In actual hardware this is not necessary because tristate can be used, so long as you can create a priority circuit that only allows one of the tristate buffers to be on for a given PC in the table, but this is also part of the process of getting the number as well using a priority encoder. In a normal encoder, you assume that only one input is chosen and output the number for the bit that was selected. However, a processor could be interrupted from multiple sources at once, so it is necessary to prioritize certain interrupts over others, since they cannot be handled simultaneously. The priority encoder has to do as said above and ensure only one input bit is visible to the actual encoder portion, depending on the priority scheme. For now, core0 uses a simple priority scheme: the highest bus ID always comes first. Eventually the priority might become programmable in a later version, but for now this is a sufficient scheme. I thought making a parameterized priority encoder would be trivial, but I was sure wrong! This is the diagram for a priority encoder. I ended up writing my Verilog a bit differently, using a string of OR and AND gates to compute the outputs and the output bit that specifies if any inputs are used at all. I hope that any smart synthesis software will be able to optimize this, because this was difficult enough as it is and it could be simplified. On top of all of this, I had to create an array of wires for every output OR gate so I could OR the variable amount of parameters together. I also had to compute which input bits corresponded to which output bit (check the code). You have to create enough wires to hold all of the intermediary AND and OR operations and use a generate loop. Yes, this is frustrating, but it works. I created a unit test for the encoder as well to make sure everything works properly. Who would have known it would take several hours to write 50 lines of Verilog?

Priority encoder unit test

Once the core is complete, I can go ahead with writing the terminal_test, which is a module that will simulate an environment with a terminal UARC core that communicates keystrokes across the bus and receives text to display on the terminal. It will also load a FORTH OS that I plan to write in assembly and add to the repository as well. Of course it will use my assembler to build the FORTH kernel. The reason I am using FORTH is to fully test the core in a programmable environment that I can write easily in assembly and possibly to play around with its capabilities before I write an LLVM backend for the core so I can compile programming languages for it.

Monday, September 7, 2015

It has been a long time since my last post. School was too busy in my previous semester to consider updating this blog, much less work on personal projects. This semester it should be different, so I will update more often. The project I was working on previously I abandoned. During my semester I came up with so many more projects that I wanted to work on. Here are the projects I have recently been working on in the order I started them (now I put everything on GitHub):

https://github.com/mathewv/minebos
This project is an attempt to write a proper ANSI FORTH OS with less bugs to supplant MineOS for the 65el02, the processor used in Eloraam's Minecraft mod: Redpower 2. My friend made the GitHub page, but he never contributed to his own project. Due to his lack of contribution and my interest in other projects, I haven't worked on this recently. This was when I first made my GitHub account; every project thereafter I have put on GitHub. I plan to pick this back up once Eloraam releases the specs for her new 65x02 derivative processor emulator that she will use in Redpower 3, a standalone version of Redpower that will be released on Steam.

After working on this, I learned Google Go (Golang). A large void in the time that I was working on projects on GitHub between June 24th to July 7th appeared because I actually created several repositories on GitHub using Golang, but then realized that Golang could not get the performance I needed for the projects I was working on, so I deleted them. I truly believe that Golang is fun to code in and I love its design, but it desperately needs generics for compile-time static checks on several datatypes such as linked lists. After using so much C++ in the past, I have acquired much respect for the RAII model, so I also detested Golang's garbage collection (GC) to an extent, since it actually could make code more verbose in some cases than when using RAII, in addition to requiring some manual cleanup for data types in which a memory leak could occur with GC. In the future I may learn Rust, but looking at some tests, it seems that Rust may actually take more code than C++, which means that its clever design might actually not help in terms of code conciseness and maintainability. I still plan to learn it eventually though, because I have read over its spec and I think that it does have some merit in addition to having every feature of C++ that I thought of while reading the spec.

https://github.com/vadixidav/gpi-cpp
This is the first project I hosted on my GitHub account, but it was originally in Golang before I took it down and rewrote it in C++ (update: it's been rewritten in Rust now; above is old version). This is a library I made to implement what I think is a very good algorithm for machine learning when applied specifically as a genetic program and used with artificial selection. It uses multi-expression programming (MEP), which is a design where a chromosome has a series of instructions that each can reference the outputs of previous instructions. I extended this notion to allow them to reference the outputs of all previous chromosomes in a whole program and also the inputs to the program. This model makes crossover very effective and simple or complicated algorithms can arise that reuse existing information. I also made it so that every "instruction" on the chromosome contains a choosing function and two possible sub instructions so that a less-than condition is checked before deciding which instruction to execute, allowing for branching paths and complicated logic to appear. Since the logic has a high level of complexity to create simple systems, it takes a long time to evolve simple programs using this genetic algorithm, however the capability of this genetic algorithm to generate complicated systems should be very high, and it should be able to solve those problems, but only when applied as a genetic program. This library cannot be used for feedback propagation training, only as an evolutionary/genetic algorithm. I will create a neural network library in the future that can be layered over input data, such as a convolutional network to create a combined program that can use the logic capabilities of MEP with the signal processing capabilities of convolutional networks. I am not sure if it would then be possible to train the convolutional network, since the desired outputs are unknown, but neural networks would be far more suitable for processing input signals.

https://github.com/vadixidav/phitron
This is a simple physics library I designed specifically because I often write the same particle physics routines over and over, so I decided to write this so it can all come from one place. Over time I may add new features to it, but for now it contains only basic interactions between particles, up to and including Lorentz forces between two moving particles. Every interaction in the library has either one version that can be used in parallel programs or two versions where one can be used in parallel applications and the other can be used in serial applications. For instance, one can gravitate an individual particle towards a point, or they can call a function that gravitates two particles to each other, which affects the state of both particles. In a parallel application, the state of both particles may need to be updated independently, so I made separate versions of these functions. I also wrote this for the next project I wrote.

https://github.com/vadixidav/evomata10
This project is the end result of Phitron and GPI: an automaton where cells are particles that can only reproduce by mating and spawn into clusters where they are connected to other cells of their species. Any cell can choose to connect with nearby cells and sever the connection with other cells as well. They can also send signals over these connections and attract/repel from each other at the cost of some food. Friction is applied to the cells and energy enters the system when cells appear, which can then be eaten by other cells. The crossover algorithm in GPI is applied to the cells so that they can breed. Unfortunately the simulation I ran here to test GPI went worse than I was hoping. It seems that grid-based simulation just result in more interesting results and are easier for cells to learn how to navigate. I was probably too optimistic in random genetic algorithms ability to succeed in navigating a 3d environment with particle physics. I also spent most of my time on this project working on the graphics. This is the first time I have worked with OpenGL 3.3, and I made sure to make a very proper object-oriented API to layer over OpenGL. The shaders look nice and I think that it turned out well. Unfortunately, the computer I developed this on does not have the greatest graphics memory bandwidth, which was the major determinant in performance for this application. I changed all the floats to half-floats, but still I only saw a speed up of 2x, which is consistent with being memory-bound. I never had to offload the GPI library to a GPU because I was always limited by the graphics. Technically I could have skipped several frames at a time, but then it doesn't look quite as nice.

https://github.com/Aissurteivos/mdrngzer
This project is a joint work between me and Aissurteivos. It is a randomizer for Pokemon Mystery Dungeon: Explorers of Sky. I mostly worked on it for fun and to help him out, since he is relatively new to C++ programming. He is now streaming a Twitch Plays of the game on his stream. It randomizes almost everything except the starting Pokemon due to a lack of Pokemon portraits; this includes items on maps (which was done using a very annoying system), Pokemon on maps, Pokemon abilities, Pokemon moves (from TMs, eggs, and learned), and several other things (check out rom.h to see).

https://github.com/vadixidav/nop (update: took it down)
This project doesn't even contain anything yet! I will get back to the project later, but I do have code written on my computer. The concept is that I will make a language that has only objects with constructors and destructors. It will be possible to "call" a function by creating a temporary instance of a function object, which allocates memory needed for the function and calls the constructor followed by the destructor. This is the simplest possible language I could imagine that would be useful, but it definitely isn't THAT useful and I will only pursue it as a very simple language for scripting or some other nonsense. The inclusion of anonymous objects as scopes would be a good way to make it more useful. Perhaps I will put more thought into this project later, but I presently have more important things to work on first.

https://github.com/uarc/emu
What is this UARC thing? It is a bit random, but I am now going to explain what this whole "UARC" thing is about. UARC is my idea for a Universal computer ARChitecture (UARC). I think that as time moves on, it is obvious that the Von Neumann Bottleneck is becoming a problem. When I refer to this bottleneck, I am referring to getting memory to where processing elements are so that it can be processed and then something can be done with it. For this problem I turn to computer clusters and grid computing. Right now the computation model has consistently been Symmetric Multiprocessing (SMP). This model entails that within a single processor multiple lines of execution of separate programs may be explored simultaneously, while presenting the same memory address space to all programs so that they can share data through main memory. Nothing breaks from this model aside from very small projects such as AsAP and Green Arrays' GA144 and the more popular cluster and grid computers. A cluster computer is designed so that computers can communicate over a high-speed bus to other computers, which is very important when you have a very parallel load running in separate memory spaces. Processor companies such as Intel and AMD thought they were catching up with the times by making SMP processor designs, but this is only a stop gap that can appease those people that require backwards compatibility with programs that are not properly parallelized and help programmers carry their existing skill-sets and tools over from the previous generation of computers. Unfortunately the activation energy has not been put in to push these companies to work on chips that cost billions of dollars to design and make, since even if you made a core that moved away from the SMP model, it likely wouldn't be used by many people.

Now it is 2015, and several things are lining up that I believe will begin the process of the downfall of SMP:

  • Many developers are choosing to run powerful algorithms on GPUs, which ALSO share memory, but unlike their CPU counterparts have many parallel cores that can operate on local caches and have access to a massive amount of memory bandwidth at a much higher latency. So long as you don't need to perform random accesses on a large memory space, a GPU is going to be very efficient at running your algorithm. Neural networks are especially good to run on GPUs, along with any sort of physics or rendering.
  • Cloud computing (and not just cloud storage, REAL COMPUTING) is starting to hit the consumer market. Cloud computing is all about getting efficient processing power and the architecture of a cloud system needs to be able to run a highly parallelized algorithm.
  • Existing SMP designs cannot scale further. It is clear that the complicated core design and large cache of the x86, intended to further the speeds of programs using their design, cannot continue to scale into more cores. If they did continue to scale, a memory bottleneck would be imminent, and the only way to solve the problem would be to add more caches, taking more die space, and making communication between cores even slower.
  • Companies need to seek ways to get more processing power on a die. Non-Uniform Memory Access (NUMA) is a way to get similar performance gains to what I am trying to achieve, but it still continues to enforce the ability of cores to access memory directly from other cores, which reduces the possibilities of what can be achieved by a decoupled system. NUMA is not wrong in any way, and it could be used to great effect on a single chip, but only alongside other systems.
So if NUMA is okay to use alongside "other systems," what exactly am I talking about? I want to bring the cluster computer to the chip. UARC is the bus and system I am designing to make multi-computer systems on a single chip using several different cores of entirely different and specialized design. By this I mean that graphics, physics, signal processing, communicating over the network, talking over buses, general purpose computing, and reading/writing to/from storage could all be handled by programmable or even non-programmable cores that could be implemented using another sub-cluster of cores, application-specific integrated circuits (ASICs), or even field-programmable gate arrays (FPGAs). The same bus and system can be extended beyond the core to create clusters of computers that combine cluster chips. Nothing is to say that if SMP is useful that it might tag along in this sort of system, but at some point to achieve more parallelism, it is necessary to start creating clusters, and this is where UARC will begin. In addition, I specify a system of privileges and groups that can cross the boundary of chips and ensure that programs can have access to as much resources as they are permitted on a cloud-computing system so that cloud-computing resources can be easily allocated by a parent operating system to whole clusters, chips, or even single cores. Given a proper cloud computing interface, it should be possible for programmers to dynamically ask for resources from such a cloud computing system to render movies, simulate physics, and sift through data while still having access to the underlying architecture and all of the inter-core communication advantages.

What other problems did I seek to solve? If you are going to have several programmable systems on a chip, reducing the program memory is key, since every core must have one (though a core could be devised where several processing elements share one program memory, and such a core is something I am interested in). Since these cores no longer have to be suitable for SMP, there is no need for task switching. Rather, all communication happens through interrupts (that also pass values) and streaming. Since task switching is gone, it is okay to increase the amount of state elements in the processor, so long as the size of caches and register files does not influence the size of the physical die space too greatly. In SMP, when a switch between threads occurs, the registers and state information needs to be preserved between threads, which incurs an overhead, this is gone in this situation. To reduce the program memory, I created a stack machine. I have also looked at belt machines, and I think they are interesting, but I am certain that for this application the stack machine is better.

Traditional stack machines have a lot of limitations that make them worse than register machines. However, I feel as though I have solved almost all of these issues. The first thing to note is that in any mathematical expression, the inputs may be loaded onto a stack in an order such that they can be processed in order as they appear on the stack. If data is retrieved in the correct order, it can be operated upon with the same amount of instructions as a register machine, except that the instructions may be as much as 4x smaller (in my case the instructions are 8 bits each). All mathematical operators produce one output, which means that they can be ordered correctly. However, in computers, there is one operation that is very frustrating: division. In computing, we have to be able to produce a remainder and a quotient from division, which means there are two outputs. Also, division takes MANY CYCLES to execute, which is also shared by many other common routines such as square root. To knock out two birds with one stone, I created a FIFO queue to store the remainder and quotient of each division, and allow them to be computed asynchronously in a pipeline. The quotient and remainder can be read out as necessary and where they are necessary on the stack. If only the quotient is read out, then if the next read from the FIFO is a quotient, it will automatically move to the next division operation's results. If two divisions need their results used in a strange order, the stack alone is not good enough to solve this problem, so there is a system in place to handle all sorts of other issues.

The dstack (data stack), which is the stack on which all operations take place in this architecture, is a special sort of stack. It has the flexibility of a register file and is not intended to be highly dense like other stacks found in this design. Out of the 256 instructions possible in my 8-bit instructions, 64 are dedicated to rotate and copy operations on the dstack. A copy copies an element to the top of the dstack. This is useful so that if two expressions reuse the same variable, a copy to the top of the stack is sufficient to use it for one expression, while also reusing it for another. The next operation is rotate. Rotate exists for the purpose of accumulators. If the programmer creates a loop, it is quite likely that they will be constantly modifying some variable, possibly adding up values or doing some other operation. This is a very important and common task, so I have included an operation to rotate (move) something from the middle of the stack and put it on top. So long as the compiler understands where all of these variables are during this process, it can continue to use this accumulator. This can also be used to re-order the stack, should the stack machine architecture fail to be fully-optimized by the compiler. The stack can be much deeper than the copy and rotation limit.

The tstack (temporary stack) is my next design. The name infers that temporaries will be stored to it, but it could be used for other purposes. Many times, a value will be computed and then be used many times in a loop, and for a function to clean up the stack when it is finished would introduce extra overhead for a stack machine. Rotation was not enough for this problem. I made a stack that is actually internally a register file. This "stack" is a randomly addressable register file that can be read and written to, but only relative to a particular pointer that moves forwards and backwards through its memory, which is how the stack is implemented. Temporary values can be pushed to the top of this stack, moving the pointer backwards. They can then be read in one instruction whenever needed in a loop or anywhere else. As soon as temporary values are no longer needed in a function, it can reuse the registers to write new temporary values, or values could be spilled from the dstack if they must. When a function returns and the dedicated return stack of the core is used, the tstack's pointer is returned to the place it was before the function call occurred. This means, like in a register machine, temporary values can be loaded into registers without having to clean them up, and since the state is automatically restored, this is somewhat better than a register machine in that sense. It is not intended, but possible, for functions to pass parameters on the tstack, but they should be passed on the dstack and then loaded into the tstack in the general case.

Now, one large issue with a stack system like this is that every time a program needs to process a stream in memory or do an indexed random read, it must rotate, copy, or read that value to the top of the stack. Also, stacks implemented in main memory are also a pain for the same reason. The solution to this is data counters (DCs). This core has four DCs. Each of them is an address that can be read/written from/to in one cycle. When this happens, the DC advances, but in either direction. The processor determines this direction by the instruction it uses to set the DC initially. The direction can be different for read and write. If the direction is different for read and write, then it automatically sets itself up such that when a stream of data is written, it will be read in reverse order, and the DC is a pre-decrement/increment pointer to a stack in memory. Indexed reads and writes can also be performed on a DC. Since a DC implies that a large chunk of memory that needs to be randomly read/written or streamed in is right in front of it, this will be a hint to the processor that it also must cache the memory in front of the DC without any cache hint instructions needed to be issued.

The DC is to be used to load all constants and some variables from a function's specific memory segment. This architecture does not store constants in program memory, instead making them stored staggered in a stream that is to be read by a DC. The architecture has four DCs so that one can be used in this way while the other three are used to process vector data, such as two read locations and one storage location. It can also be used to process on four memory locations simultaneously and then to restore the correct DC afterwards and cache all temporary and constant values on the tstack before executing.

As was hinted at earlier, a dedicated return stack is present, to avoid leaving data on the dstack that isnt passed and returned explicitly between functions. However, there is also a Zero-Overhead Loop Unit. This can be used to avoid even more instructions and shrink programs further, but it also speeds up the processor and effectively makes each core a vector processor, since it can use a hardware loop on a simple algorithm such as an accumulate operation, or adding up energy potentials between particles in a physics engine.

Finally, as it may have not been obvious, this is a Harvard architecture rather than a Von Neumann architecture. I wanted to avoid addressing at the byte level, and I wanted to maximize memory bandwidth usage through DCs and streams, so program memory needs to be separate. It is also important because if the program is only written once, then it is more die-space efficient to store the program in a ROM rather than a RAM such as SRAM or DRAM, which could take more space and possibly more energy to use.

The architecture I described is that of core0, the first official core that is part of the UARC specification. This doesn't mean anything except that it is a reference core for general purpose computing using UARC, and I think that it is the ideal sort of core for this system. I will not hesitate to make new versions and sacrifice backwards compatibility if the design would be better changed in some way. Unfortunately I do not have the money to physically implement and test this processor, but I am in the process of writing the Verilog implementation of the core for simulation on an FPGA: https://github.com/uarc/core0.

I also made an assembler for UARC that works with the emulator. It has a test program to demonstrate. This assembler is highly generic so that it takes config files to generate machine code for any host of architectures. The instruction convention should be consistent across architectures, but it may change over time.

Since I need to test my processor, I will be writing a FORTH OS to test all of its features and ensure the Verilog design is correct. This VGA controller is what I will be using to output to a display. Since I could find no open source controllers online that I thought were good, I decided to upload my implementation to save anyone else the trouble of having to design their own. You can use this with CRT, CVT, and reduced-blanking VGA implementations, but you are responsible for setting the horizontal and vertical sync polarities and the times. An example of how to use it is provided for the 1080p VESA standard for reduced blanking VGA display in the README.md.

In the future I will continue to finish the uarc-core0 design and then go on to making an OS for it to test it, which I will put on GitHub. Then I will make a few more cores using a similar design for different purposes and finally create a LLVM backend so applications can be ported to the architecture. Unfortunately, many applications that rely heavily on SMP, including the Linux kernel, likely cannot be ported, but this will be a good start. Ironically, single-threaded applications will be the easiest to port, but I will have to develop APIs to access the core inter-communication UARC provides.