Friday, January 22, 2016

In a previous blog post I mentioned that I would eventually learn Rust, a programming language I thought might be a good replacement for C++. Well, after some time, I can now say that I am a Rustacean.

I have recently been wanting to work on a new automaton which requires particle physics, some glowy graph drawing, and machine learning algorithms. I am working presently to make these libraries in rust. My machine learning library was the first thing I wrote in Rust, and I intend to improve it significantly with what I've learned so far before publishing it. My particle physics library is at version 1.0.1 and I have published it on crates.io. I intend to add Barnes-Hut tree simulation to the library at some point, and a branch exists in the repository with code for Octrees, but I found that this was a complicated ordeal and have suspended this effort to continue working on other things.

Right now I am working on glowygraph, the library intended to use shaders to draw fancy glowing orbs and lines. This will be used to write two applications I plan to work on: a library for drawing and navigating 3d force-directed graphs and the automaton I am planning to work on.

Now, after having used Rust a fair amount, I feel that I can write this blog post to comment some of my thoughts about it. Firstly, the Rust community is amazing and filled with systems programmers and people that care about large project maintainability. People are constantly making good libraries and publishing them to crates.io. We have a plethora of libraries and applications written in Rust already, despite it being so new. The most important thing Rust has to offer is its powerful language saftey. Designers have strong powers to prevent users from incorrectly using the API of their libraries and modules. Rust also guarantees you won't cause a segfault so long as you don't use an "unsafe" block (which gives you the freedom of pointers and unsafe operations), or use any libraries with bugs in them. It is common to never get a segfault in rust even during development, and for userspace applications unsafe blocks are not necessary even for large projects. Libraries might have to use unsafe blocks to implement things that are safe, but Rust does not understand are safe. On top of this, Rust guarantees that you will have no race conditions, again so long as you avoid unsafe blocks. Rust is built around threading, but the standard library is a bit restrictive. I recommend the crate (library) crossbeam if one needs more powerful threading capabilities.

Another important point about Rust is how easy it is to use libraries. Rust has a build system called cargo which is quite simple but amazingly powerful. It connects with crates.io and makes publishing crates as easy as running a single command. Using crates is as simple as specifying a version in the Cargo.toml file for the project. It automatically fetches the most recent compatible version using semantic versioning from crates.io. This is amazing power at a Rust developer's disposal. On top of that, cargo supports tests (built and ran), examples (built and not ran), and benchmarks (built, ran, and timed), which are all handled when running 'cargo test'.

I could type for hours about the potential benefits of Rust as a replacement for C++, but just know that learning it is absolutely worth it. In the future I see Rust replacing C++ as the systems language of choice for writing new applications when all of the dependencies are available. If you are interested, watch this video to see what Rust is all about (it is a bit old, but the ideas are the same):

I will try to keep the blog updated again with projects as I work on them. College is my primary focus now, but my projects are second in line, and this is my best way to keep people updated in as much detail as they like.

Thursday, September 10, 2015

Recently I have been focusing entirely on UARC core0, my first UARC processor project. As a C++ programmer, I regularly like to take advantage of generic containers in the Standard Template Library (STL). When you are programming, there are interesting data structures to use, all of which could use generic template classes. Occasionally you have to implement your own, but most often the STL has all the basic structures you need. Generics have some other nice uses, like making templated virtual base classes so that you can use polymorphism on multiple objects that all share the same public interface, but might use slightly different types or other variables, so long as all of those objects inherit from the same version of the base class. However, C++ is not Verilog.

In Verilog, so many things that should be obviously supported are not. For instance, one cannot pass a multidimensional array of wires into a module, which is basically the "class" of Verilog. This is a problem if you want to pass an array of n-bit values into the module. If you do, then you have to transform it into one single array and then back inside the module—what a pain! Many of these issues are fixed in SystemVerilog, which introduces so many useful features, but is still not universally supported. Since Verilog is ubiquitous, I have to use it for my processor.

Verilog does have one feature that I think would make it completely useless without: parameters. It IS possible to make generic modules in Verilog that have variable inputs, outputs, or constant values. I used this in my VGA controller to allow for different VGA timings and modes. The timing is set through parameters. In core0, I must be able to have a variable amount of targets to communicate with connected through the inputs. This works fine for most purposes, but one particular thing annoyed me to no end today. I have to be able to handle interrupts from multiple targets inside the core itself, but I need to be able to get a number so I can look up the PC to set for a particular interrupt. In actual hardware this is not necessary because tristate can be used, so long as you can create a priority circuit that only allows one of the tristate buffers to be on for a given PC in the table, but this is also part of the process of getting the number as well using a priority encoder. In a normal encoder, you assume that only one input is chosen and output the number for the bit that was selected. However, a processor could be interrupted from multiple sources at once, so it is necessary to prioritize certain interrupts over others, since they cannot be handled simultaneously. The priority encoder has to do as said above and ensure only one input bit is visible to the actual encoder portion, depending on the priority scheme. For now, core0 uses a simple priority scheme: the highest bus ID always comes first. Eventually the priority might become programmable in a later version, but for now this is a sufficient scheme. I thought making a parameterized priority encoder would be trivial, but I was sure wrong! This is the diagram for a priority encoder. I ended up writing my Verilog a bit differently, using a string of OR and AND gates to compute the outputs and the output bit that specifies if any inputs are used at all. I hope that any smart synthesis software will be able to optimize this, because this was difficult enough as it is and it could be simplified. On top of all of this, I had to create an array of wires for every output OR gate so I could OR the variable amount of parameters together. I also had to compute which input bits corresponded to which output bit (check the code). You have to create enough wires to hold all of the intermediary AND and OR operations and use a generate loop. Yes, this is frustrating, but it works. I created a unit test for the encoder as well to make sure everything works properly. Who would have known it would take several hours to write 50 lines of Verilog?

Priority encoder unit test

Once the core is complete, I can go ahead with writing the terminal_test, which is a module that will simulate an environment with a terminal UARC core that communicates keystrokes across the bus and receives text to display on the terminal. It will also load a FORTH OS that I plan to write in assembly and add to the repository as well. Of course it will use my assembler to build the FORTH kernel. The reason I am using FORTH is to fully test the core in a programmable environment that I can write easily in assembly and possibly to play around with its capabilities before I write an LLVM backend for the core so I can compile programming languages for it.

Monday, September 7, 2015

It has been a long time since my last post. School was too busy in my previous semester to consider updating this blog, much less work on personal projects. This semester it should be different, so I will update more often. The project I was working on previously I abandoned. During my semester I came up with so many more projects that I wanted to work on. Here are the projects I have recently been working on in the order I started them (now I put everything on GitHub):

https://github.com/mathewv/minebos
This project is an attempt to write a proper ANSI FORTH OS with less bugs to supplant MineOS for the 65el02, the processor used in Eloraam's Minecraft mod: Redpower 2. My friend made the GitHub page, but he never contributed to his own project. Due to his lack of contribution and my interest in other projects, I haven't worked on this recently. This was when I first made my GitHub account; every project thereafter I have put on GitHub. I plan to pick this back up once Eloraam releases the specs for her new 65x02 derivative processor emulator that she will use in Redpower 3, a standalone version of Redpower that will be released on Steam.

After working on this, I learned Google Go (Golang). A large void in the time that I was working on projects on GitHub between June 24th to July 7th appeared because I actually created several repositories on GitHub using Golang, but then realized that Golang could not get the performance I needed for the projects I was working on, so I deleted them. I truly believe that Golang is fun to code in and I love its design, but it desperately needs generics for compile-time static checks on several datatypes such as linked lists. After using so much C++ in the past, I have acquired much respect for the RAII model, so I also detested Golang's garbage collection (GC) to an extent, since it actually could make code more verbose in some cases than when using RAII, in addition to requiring some manual cleanup for data types in which a memory leak could occur with GC. In the future I may learn Rust, but looking at some tests, it seems that Rust may actually take more code than C++, which means that its clever design might actually not help in terms of code conciseness and maintainability. I still plan to learn it eventually though, because I have read over its spec and I think that it does have some merit in addition to having every feature of C++ that I thought of while reading the spec.

https://github.com/vadixidav/gpi-cpp
This is the first project I hosted on my GitHub account, but it was originally in Golang before I took it down and rewrote it in C++ (update: it's been rewritten in Rust now; above is old version). This is a library I made to implement what I think is a very good algorithm for machine learning when applied specifically as a genetic program and used with artificial selection. It uses multi-expression programming (MEP), which is a design where a chromosome has a series of instructions that each can reference the outputs of previous instructions. I extended this notion to allow them to reference the outputs of all previous chromosomes in a whole program and also the inputs to the program. This model makes crossover very effective and simple or complicated algorithms can arise that reuse existing information. I also made it so that every "instruction" on the chromosome contains a choosing function and two possible sub instructions so that a less-than condition is checked before deciding which instruction to execute, allowing for branching paths and complicated logic to appear. Since the logic has a high level of complexity to create simple systems, it takes a long time to evolve simple programs using this genetic algorithm, however the capability of this genetic algorithm to generate complicated systems should be very high, and it should be able to solve those problems, but only when applied as a genetic program. This library cannot be used for feedback propagation training, only as an evolutionary/genetic algorithm. I will create a neural network library in the future that can be layered over input data, such as a convolutional network to create a combined program that can use the logic capabilities of MEP with the signal processing capabilities of convolutional networks. I am not sure if it would then be possible to train the convolutional network, since the desired outputs are unknown, but neural networks would be far more suitable for processing input signals.

https://github.com/vadixidav/phitron
This is a simple physics library I designed specifically because I often write the same particle physics routines over and over, so I decided to write this so it can all come from one place. Over time I may add new features to it, but for now it contains only basic interactions between particles, up to and including Lorentz forces between two moving particles. Every interaction in the library has either one version that can be used in parallel programs or two versions where one can be used in parallel applications and the other can be used in serial applications. For instance, one can gravitate an individual particle towards a point, or they can call a function that gravitates two particles to each other, which affects the state of both particles. In a parallel application, the state of both particles may need to be updated independently, so I made separate versions of these functions. I also wrote this for the next project I wrote.

https://github.com/vadixidav/evomata10
This project is the end result of Phitron and GPI: an automaton where cells are particles that can only reproduce by mating and spawn into clusters where they are connected to other cells of their species. Any cell can choose to connect with nearby cells and sever the connection with other cells as well. They can also send signals over these connections and attract/repel from each other at the cost of some food. Friction is applied to the cells and energy enters the system when cells appear, which can then be eaten by other cells. The crossover algorithm in GPI is applied to the cells so that they can breed. Unfortunately the simulation I ran here to test GPI went worse than I was hoping. It seems that grid-based simulation just result in more interesting results and are easier for cells to learn how to navigate. I was probably too optimistic in random genetic algorithms ability to succeed in navigating a 3d environment with particle physics. I also spent most of my time on this project working on the graphics. This is the first time I have worked with OpenGL 3.3, and I made sure to make a very proper object-oriented API to layer over OpenGL. The shaders look nice and I think that it turned out well. Unfortunately, the computer I developed this on does not have the greatest graphics memory bandwidth, which was the major determinant in performance for this application. I changed all the floats to half-floats, but still I only saw a speed up of 2x, which is consistent with being memory-bound. I never had to offload the GPI library to a GPU because I was always limited by the graphics. Technically I could have skipped several frames at a time, but then it doesn't look quite as nice.

https://github.com/Aissurteivos/mdrngzer
This project is a joint work between me and Aissurteivos. It is a randomizer for Pokemon Mystery Dungeon: Explorers of Sky. I mostly worked on it for fun and to help him out, since he is relatively new to C++ programming. He is now streaming a Twitch Plays of the game on his stream. It randomizes almost everything except the starting Pokemon due to a lack of Pokemon portraits; this includes items on maps (which was done using a very annoying system), Pokemon on maps, Pokemon abilities, Pokemon moves (from TMs, eggs, and learned), and several other things (check out rom.h to see).

https://github.com/vadixidav/nop (update: took it down)
This project doesn't even contain anything yet! I will get back to the project later, but I do have code written on my computer. The concept is that I will make a language that has only objects with constructors and destructors. It will be possible to "call" a function by creating a temporary instance of a function object, which allocates memory needed for the function and calls the constructor followed by the destructor. This is the simplest possible language I could imagine that would be useful, but it definitely isn't THAT useful and I will only pursue it as a very simple language for scripting or some other nonsense. The inclusion of anonymous objects as scopes would be a good way to make it more useful. Perhaps I will put more thought into this project later, but I presently have more important things to work on first.

https://github.com/uarc/emu
What is this UARC thing? It is a bit random, but I am now going to explain what this whole "UARC" thing is about. UARC is my idea for a Universal computer ARChitecture (UARC). I think that as time moves on, it is obvious that the Von Neumann Bottleneck is becoming a problem. When I refer to this bottleneck, I am referring to getting memory to where processing elements are so that it can be processed and then something can be done with it. For this problem I turn to computer clusters and grid computing. Right now the computation model has consistently been Symmetric Multiprocessing (SMP). This model entails that within a single processor multiple lines of execution of separate programs may be explored simultaneously, while presenting the same memory address space to all programs so that they can share data through main memory. Nothing breaks from this model aside from very small projects such as AsAP and Green Arrays' GA144 and the more popular cluster and grid computers. A cluster computer is designed so that computers can communicate over a high-speed bus to other computers, which is very important when you have a very parallel load running in separate memory spaces. Processor companies such as Intel and AMD thought they were catching up with the times by making SMP processor designs, but this is only a stop gap that can appease those people that require backwards compatibility with programs that are not properly parallelized and help programmers carry their existing skill-sets and tools over from the previous generation of computers. Unfortunately the activation energy has not been put in to push these companies to work on chips that cost billions of dollars to design and make, since even if you made a core that moved away from the SMP model, it likely wouldn't be used by many people.

Now it is 2015, and several things are lining up that I believe will begin the process of the downfall of SMP:

  • Many developers are choosing to run powerful algorithms on GPUs, which ALSO share memory, but unlike their CPU counterparts have many parallel cores that can operate on local caches and have access to a massive amount of memory bandwidth at a much higher latency. So long as you don't need to perform random accesses on a large memory space, a GPU is going to be very efficient at running your algorithm. Neural networks are especially good to run on GPUs, along with any sort of physics or rendering.
  • Cloud computing (and not just cloud storage, REAL COMPUTING) is starting to hit the consumer market. Cloud computing is all about getting efficient processing power and the architecture of a cloud system needs to be able to run a highly parallelized algorithm.
  • Existing SMP designs cannot scale further. It is clear that the complicated core design and large cache of the x86, intended to further the speeds of programs using their design, cannot continue to scale into more cores. If they did continue to scale, a memory bottleneck would be imminent, and the only way to solve the problem would be to add more caches, taking more die space, and making communication between cores even slower.
  • Companies need to seek ways to get more processing power on a die. Non-Uniform Memory Access (NUMA) is a way to get similar performance gains to what I am trying to achieve, but it still continues to enforce the ability of cores to access memory directly from other cores, which reduces the possibilities of what can be achieved by a decoupled system. NUMA is not wrong in any way, and it could be used to great effect on a single chip, but only alongside other systems.
So if NUMA is okay to use alongside "other systems," what exactly am I talking about? I want to bring the cluster computer to the chip. UARC is the bus and system I am designing to make multi-computer systems on a single chip using several different cores of entirely different and specialized design. By this I mean that graphics, physics, signal processing, communicating over the network, talking over buses, general purpose computing, and reading/writing to/from storage could all be handled by programmable or even non-programmable cores that could be implemented using another sub-cluster of cores, application-specific integrated circuits (ASICs), or even field-programmable gate arrays (FPGAs). The same bus and system can be extended beyond the core to create clusters of computers that combine cluster chips. Nothing is to say that if SMP is useful that it might tag along in this sort of system, but at some point to achieve more parallelism, it is necessary to start creating clusters, and this is where UARC will begin. In addition, I specify a system of privileges and groups that can cross the boundary of chips and ensure that programs can have access to as much resources as they are permitted on a cloud-computing system so that cloud-computing resources can be easily allocated by a parent operating system to whole clusters, chips, or even single cores. Given a proper cloud computing interface, it should be possible for programmers to dynamically ask for resources from such a cloud computing system to render movies, simulate physics, and sift through data while still having access to the underlying architecture and all of the inter-core communication advantages.

What other problems did I seek to solve? If you are going to have several programmable systems on a chip, reducing the program memory is key, since every core must have one (though a core could be devised where several processing elements share one program memory, and such a core is something I am interested in). Since these cores no longer have to be suitable for SMP, there is no need for task switching. Rather, all communication happens through interrupts (that also pass values) and streaming. Since task switching is gone, it is okay to increase the amount of state elements in the processor, so long as the size of caches and register files does not influence the size of the physical die space too greatly. In SMP, when a switch between threads occurs, the registers and state information needs to be preserved between threads, which incurs an overhead, this is gone in this situation. To reduce the program memory, I created a stack machine. I have also looked at belt machines, and I think they are interesting, but I am certain that for this application the stack machine is better.

Traditional stack machines have a lot of limitations that make them worse than register machines. However, I feel as though I have solved almost all of these issues. The first thing to note is that in any mathematical expression, the inputs may be loaded onto a stack in an order such that they can be processed in order as they appear on the stack. If data is retrieved in the correct order, it can be operated upon with the same amount of instructions as a register machine, except that the instructions may be as much as 4x smaller (in my case the instructions are 8 bits each). All mathematical operators produce one output, which means that they can be ordered correctly. However, in computers, there is one operation that is very frustrating: division. In computing, we have to be able to produce a remainder and a quotient from division, which means there are two outputs. Also, division takes MANY CYCLES to execute, which is also shared by many other common routines such as square root. To knock out two birds with one stone, I created a FIFO queue to store the remainder and quotient of each division, and allow them to be computed asynchronously in a pipeline. The quotient and remainder can be read out as necessary and where they are necessary on the stack. If only the quotient is read out, then if the next read from the FIFO is a quotient, it will automatically move to the next division operation's results. If two divisions need their results used in a strange order, the stack alone is not good enough to solve this problem, so there is a system in place to handle all sorts of other issues.

The dstack (data stack), which is the stack on which all operations take place in this architecture, is a special sort of stack. It has the flexibility of a register file and is not intended to be highly dense like other stacks found in this design. Out of the 256 instructions possible in my 8-bit instructions, 64 are dedicated to rotate and copy operations on the dstack. A copy copies an element to the top of the dstack. This is useful so that if two expressions reuse the same variable, a copy to the top of the stack is sufficient to use it for one expression, while also reusing it for another. The next operation is rotate. Rotate exists for the purpose of accumulators. If the programmer creates a loop, it is quite likely that they will be constantly modifying some variable, possibly adding up values or doing some other operation. This is a very important and common task, so I have included an operation to rotate (move) something from the middle of the stack and put it on top. So long as the compiler understands where all of these variables are during this process, it can continue to use this accumulator. This can also be used to re-order the stack, should the stack machine architecture fail to be fully-optimized by the compiler. The stack can be much deeper than the copy and rotation limit.

The tstack (temporary stack) is my next design. The name infers that temporaries will be stored to it, but it could be used for other purposes. Many times, a value will be computed and then be used many times in a loop, and for a function to clean up the stack when it is finished would introduce extra overhead for a stack machine. Rotation was not enough for this problem. I made a stack that is actually internally a register file. This "stack" is a randomly addressable register file that can be read and written to, but only relative to a particular pointer that moves forwards and backwards through its memory, which is how the stack is implemented. Temporary values can be pushed to the top of this stack, moving the pointer backwards. They can then be read in one instruction whenever needed in a loop or anywhere else. As soon as temporary values are no longer needed in a function, it can reuse the registers to write new temporary values, or values could be spilled from the dstack if they must. When a function returns and the dedicated return stack of the core is used, the tstack's pointer is returned to the place it was before the function call occurred. This means, like in a register machine, temporary values can be loaded into registers without having to clean them up, and since the state is automatically restored, this is somewhat better than a register machine in that sense. It is not intended, but possible, for functions to pass parameters on the tstack, but they should be passed on the dstack and then loaded into the tstack in the general case.

Now, one large issue with a stack system like this is that every time a program needs to process a stream in memory or do an indexed random read, it must rotate, copy, or read that value to the top of the stack. Also, stacks implemented in main memory are also a pain for the same reason. The solution to this is data counters (DCs). This core has four DCs. Each of them is an address that can be read/written from/to in one cycle. When this happens, the DC advances, but in either direction. The processor determines this direction by the instruction it uses to set the DC initially. The direction can be different for read and write. If the direction is different for read and write, then it automatically sets itself up such that when a stream of data is written, it will be read in reverse order, and the DC is a pre-decrement/increment pointer to a stack in memory. Indexed reads and writes can also be performed on a DC. Since a DC implies that a large chunk of memory that needs to be randomly read/written or streamed in is right in front of it, this will be a hint to the processor that it also must cache the memory in front of the DC without any cache hint instructions needed to be issued.

The DC is to be used to load all constants and some variables from a function's specific memory segment. This architecture does not store constants in program memory, instead making them stored staggered in a stream that is to be read by a DC. The architecture has four DCs so that one can be used in this way while the other three are used to process vector data, such as two read locations and one storage location. It can also be used to process on four memory locations simultaneously and then to restore the correct DC afterwards and cache all temporary and constant values on the tstack before executing.

As was hinted at earlier, a dedicated return stack is present, to avoid leaving data on the dstack that isnt passed and returned explicitly between functions. However, there is also a Zero-Overhead Loop Unit. This can be used to avoid even more instructions and shrink programs further, but it also speeds up the processor and effectively makes each core a vector processor, since it can use a hardware loop on a simple algorithm such as an accumulate operation, or adding up energy potentials between particles in a physics engine.

Finally, as it may have not been obvious, this is a Harvard architecture rather than a Von Neumann architecture. I wanted to avoid addressing at the byte level, and I wanted to maximize memory bandwidth usage through DCs and streams, so program memory needs to be separate. It is also important because if the program is only written once, then it is more die-space efficient to store the program in a ROM rather than a RAM such as SRAM or DRAM, which could take more space and possibly more energy to use.

The architecture I described is that of core0, the first official core that is part of the UARC specification. This doesn't mean anything except that it is a reference core for general purpose computing using UARC, and I think that it is the ideal sort of core for this system. I will not hesitate to make new versions and sacrifice backwards compatibility if the design would be better changed in some way. Unfortunately I do not have the money to physically implement and test this processor, but I am in the process of writing the Verilog implementation of the core for simulation on an FPGA: https://github.com/uarc/core0.

I also made an assembler for UARC that works with the emulator. It has a test program to demonstrate. This assembler is highly generic so that it takes config files to generate machine code for any host of architectures. The instruction convention should be consistent across architectures, but it may change over time.

Since I need to test my processor, I will be writing a FORTH OS to test all of its features and ensure the Verilog design is correct. This VGA controller is what I will be using to output to a display. Since I could find no open source controllers online that I thought were good, I decided to upload my implementation to save anyone else the trouble of having to design their own. You can use this with CRT, CVT, and reduced-blanking VGA implementations, but you are responsible for setting the horizontal and vertical sync polarities and the times. An example of how to use it is provided for the 1080p VESA standard for reduced blanking VGA display in the README.md.

In the future I will continue to finish the uarc-core0 design and then go on to making an OS for it to test it, which I will put on GitHub. Then I will make a few more cores using a similar design for different purposes and finally create a LLVM backend so applications can be ported to the architecture. Unfortunately, many applications that rely heavily on SMP, including the Linux kernel, likely cannot be ported, but this will be a good start. Ironically, single-threaded applications will be the easiest to port, but I will have to develop APIs to access the core inter-communication UARC provides.

Tuesday, January 27, 2015

I have recently been fiddling to try and fix problems I have encountered on this project which is now named Dynamaton. One issue I realized is that forces all take effect at the same magnitude at any particular distance. In reality, the electron shells of atoms repel each other strongly only in close proximity. To simulate this I included something that is very expensive: each force has a base "radius." Inside of this radius, particles do not interact using that particular force. Between that radius and the radius + 1, the force linearly increases to its maximum; then after radius + 1 it decreases with the inverse square law. This means that some forces will take effect in close proximity while some will take effect further away.

The biggest issue I had to deal with was making space toroidal. Since I implemented this by using the shorter radius, meaning that if on any particular axis the length was more than half of the space I would use the radius that wraps around, if two particles were precisely half of the width of the space in distance from each other they would toggle back and forth and achieve an equilibrium. This would always result in this pattern occurring in which every clump of particles spread into a cube matching the shape of the space like this.

I decided at this point that I would either leave space unbounded or bound it without making it toroidal. Unbounded space was not an option because then stuff would fly everywhere and it couldn't be seen. I tried bounding using a cube, which worked, but then everything went to the corners of the simulation. So now I am doing this.

I made it so that the velocity bounded off of the edge of the sphere where the hypothetical "new position" would be, not changing the position, when particles exceeded the bounds of the sphere.

Presently, this is single threaded and running on my CPU. I have plans to use OpenACC to easily make this run in parallel on my GPU, which should let me have many more particles than I presently do. Space segmentation optimizations will come later. The radius issue is particularly expensive because I must compute the square inverse of the adjusted radius for each force, which reduced the amount of particles I could spawn by a factor of 10, but it does make some significantly more interesting interactions, so I will continue to do this for now.

In addition, the meshes are set up. When particles enter a threshold and leave another, they mesh together and break apart. Before I had elasticity between meshed particles, but I might simply leave that functionality off. The last feature to implement is the reaction-diffusion system, so that will be coming up shortly. I also want to improve the graphics so that the particles look more like this shader toy shader and these spells from Skyrim (much more glowy). In addition to that, I will need to switch to the new shader pipeline, since I am sadly using OpenGL 1 features so far, since graphics performance isn't the limiting factor presently. OpenGL ES might be my target so that I can port this to Android fairly easily.

Thursday, January 22, 2015

I have been working on my recent project quite a bit, though I've recently returned to college, thus I am spending a little less time on it. Particle physics and neural neural network systems are complete, though the reaction-diffusion system is partially in-place, since the chemicals are represented and interact in the particle physics simulation, which doesn't seem to be too expensive.

Over the last few days I have been learning about Radial Basis Function Networks (RBFNs). In my previous posts I only discussed the perceptron, however there is an alternative, and arguably better, function to compute the output of a node in a neural network. The perceptron multiplies its inputs by a weight and then sums them, producing output by passing this sum through a function that could be a sigmoid or a threshold with a binary output. In either case, the perceptron effectively creates a line, with one side of the line being 0 and the other being 1.
A single perceptron categorizing the input

Sometimes this splitting of space is behavior you want, but sometimes it isn't. One other way to divide space into 1 and 0 is using an oval. Now, to be clear, this is only what happens with two dimensions of input, each of which are the x and y on a graph; this method can be used for as many dimensions as you like, and an oval may become an ellipsoid in 3 dimensions. Whether oval or ellipsoid, a RBF is designed to encapsulate some area of space, indicating that this region is what we would like to include or dis-include. It seems obvious that in many situations, several regions might need to be specifically added into the data set; this is what the RBF is capable of doing.

An image of 2d space segmented with a RBFN

RBFNs do something very simple: they use the Cartesian equation for any n-dimensional ellipsoid (though technically "ellipsoid" refers to the 3-dimensional variant). The "threshold" that was used by the perceptron is a similar concept to the "radius" of the RBF. As is apparent from the image, an area is surrounded by this function, which allows the neural network to classify regions. If one did not square the distances in each dimension and used an absolute value function instead, it would create a cuboid or rectangle instead. One could make an argument for such a system, but the RBFN seems to be very popular and effective in the field. To simplify its definition: it creates a rounded space around a point and classifies inputs in that region. It gets more complicated with a whole network, rather than a single function, but the RBFN is obviously powerful.

The reason I am bringing this up is because I have decided that a RBFN will power my simulation instead of a perceptron network. As time goes on, I may consider a hybrid between the RBF and perceptron, but for now I will be making the switch, since I consider RBFs to be superior in most situations for classifying information. It is likely that I will also try the "cuboid" approach as well, but I assume that the differences will be insignificant, so the switch would be purely for efficiency reasons.

The next thing to add to my project is the coloring of the spheres based on chemicals. This is trivial, but complicated since I am not sure that a "bound" should be created for the maximum brightness of any particular color based on the input chemicals. My plan is simply to sum the chemicals multiplied by coefficients for each color, then bring e to the power of the negative result. This will give me a value starting at 1 and approaching 0 as the chemicals increase. Then I will subtract this from 1 to receive the intensity of the color. I don't particularly like this solution, but it seems the be the best way to do it; I may even apply this to the inputs for the neural network.

Friday, January 16, 2015

Since my previous (and first) post I began working on the actual simulation for the cells, but then soon ran into an obvious issue: the brains of the cells. For cellular automata it is sufficient to make a simple state machine in which cells make decisions using a sort of tiny program, but because I am using a reaction-diffusion simulation, this doesn't seem to be an ideal solution, because the output needs to be between 0 and 1, as I will explain later. My first assumption is that neural networks would be the ideal solution, since they work with non-binary results, though upon looking into the issue, it seemed that the ideal neuron was a perceptron, a simple neuron that sums weighted inputs and then runs a function on this value. Perceptrons originally were designed to output a 1 or 0 depending on if they exceed a threshold value, but they were modified to use sigmoid functions so that they can be trained using back-propagation (specifically the one 1/(1+e^-a)). In my situation, natural selection will train the creatures, since no "correct" output exists, thus I cannot use back-propogation. Another consideration would be recurrent neural networks, which are extremely difficult to train using back-propogatioin, yet they are widely used, so it seems that they would be very helpful to my situation, but they actually are complicated to process, and since I want to have lots of creatures at once, I believe that this is not the best option.

One thing I realized in my research is that a lot of effort was developed on algorithms that can be trained easily using back-propagation. One type of system that completely ignores that prospect is using, which I used with most of my older evolutionary celluar automata, is a genetic program. I believe that this is certainly a good option, so much that I wrote a whole blog post on them before I wrote this one, but as I was concluding the post I realized that this might not be a good option. After that, I did even more research into alternative algorithms, but I came up with very little that I hadn't already tried myself or read about online. I was going to use a genetic program, so I started writing the code, thinking of how I might just do it in the process. I needed to allocate space for so many small nodes that if I dynamically allocated memory from the heap for every node, I would end up wasting most of it in the overhead of allocation. It eventually forced me to accept that I would have to have some fixed maximum amount of inputs to a node. The main reason this was a problem was because I was still trying to incorporate perceptrons into my model. In addition, I would absolutely need to process the input, which in my case is all of the chemical concentrations, in a special way, since it was a scalar value that could be almost anywhere, while the output needed to be somewhere between 0 and 1, which works perfectly for a sigmoid function. Ultimately, I read up on the amazing things that deep neural networks were capable of, and then I was partially convinced, but not entirely.

One thing I felt was not considered in deep neural networks was the significance of multiplexers in logic. A multiplexer (MUX, for short) is normally used in digital logic to select from inputs using selector bits, and they can literally do everything. I really felt that I had to add these into the mix, though I can't put my finger exactly on what problem they would solve, but I just noticed that perceptrons would have a hard time selecting between two completely independent decisions, since all they can do is combine multiple decisions together depending on their weights, rather than choose one over the other. So I conceived of a hybrid model.

My model has three layers to it, which I will name according to convention. It has a fairly large input layer, which is designed to create multiple separable regions in the input space that can be used in subsequent calculations. The outputs of this layer are binary, so as to be computationally efficient, though I may switch later to sigmoid functions just to see how much better they are, but my understanding is that the purpose of the curve is just to help make back-propogation easier. The next layer is the hidden layer, which in my case is actually a series of layers each containing one muxeptron, which is a concept I imagined. Each one of these hidden layers can reach back to receive input from any previous layer, which I feel is very important to keep the input relevant even near the end of the calculation, but also so calculations can be reused. Muxeptrons are an invention I created for the purpose of adding MUXs into the mix. Each muxeptron contains three perceptrons; one perceptron will either fire or not fire; depending on whether the first perceptron fires or does not fire, one of the other two perceptrons will be chosen to be used as the output. Finally, the last layer is the output layer. Because my outputs need to be scalar values between 0 and 1, I figured this was a very good time to use a sigmoid function. The output layer can only access previous layers and uses normal perceptrons that use a sigmoid function, which is very basic and works properly to my needs. Since not all nodes will be used in the computation, I chose to compute from the output backwards. Computation begins at the output and as inputs are required, they are calculated accordingly.

Another thing I decided was that if a node was unconnected from the network, I would leave it unconnected. This seems strange, but my reasoning is that this will simulate junk DNA to some extent. In addition, if a new node is added to the network through a mutation, I will simply add it to the network unconnected; a future mutation may cause it to become connected to the network, which allows the network to gain complexity without significantly altering its function.

The purpose of this brain is to control the diffusion coefficients for each cell connection in the mesh. In each simulation step, the cells will run the "brain" algorithm to produce diffusion coefficients, which are intended to simulate active transport, though only for outgoing diffusion (I may change this in the future so that cells can transport chemicals inside of them). As in real biology, this active transport will require energy, which is also going to be a chemical (namely chemical 0, otherwise known as food) in the reaction-diffusion engine. When a cell produces a value of 0, it will force there to be no diffusion, while a value of 1 will force all chemicals to diffuse out of the cell in one cycle. However, a cell that actually managed to produce a 1 or 0 exactly (which should be impossible with the sigmoid function) would kill itself immediately because the food cost would be infinitely high. A natural rate of diffusion is established that is closer to 0 than 1 at the initialization of the simulation, and if a cell matches this rate of diffusion exactly, it will not have a food cost, thus cells should adapt to "understand" this value, possibly generating it and selecting it with their MUXs to avoid food costs when food is low.

The only things a cell can read from its environment are just the amount of chemicals inside of its own cell (including food, which IS a chemical). This means that a cell will have to use chemicals to send signals through the mesh.

Although this is unimplemented as of now, I want to share my thoughts on how the chemical reactions and physics calculations will occur. When the simulation is initialized, I will generate a series of random chemicals, then I will generate an arbitrary amount of chemical reactions that allow a few elements to combine in a 1:1 ratio into another element, so that there is conservation of chemicals. After the chemicals are all generated and organized in this fashion so that chemicals cannot be created or destroyed through chemical reactions, I will define a catalyst matrix so that each chemical acts to some degree as a catalyst for every available chemical reaction. This catalyst matrix will be multiplied by the vector of chemicals in every node, so that reaction rate constant can be determined. The reaction rate will still depend on the concentrations of the input chemicals to the reaction. I will probably bring these concentrations to the power of their ratio in the reaction, but I have a feeling that this will slow down my simulation substantially, so I might come up with an alternate solution later on after profiling.

As for the physics simulation, every chemical will interact with every other chemical. Two base force vectors will be computed: one "gravitational" force vector that makes both particles attract or repel and a "magnetic" force vector that is computed as the cross product of the velocities of the particles. Both forces will then be individually multiplied by the summation of the products of each of the two chemical concentrations multiplied by a constant coefficient for that particular force. Finally, the force vectors will be summed and then divided by the inverse of the distance squared so that they follow the inverse square law and have conservation of energy (to some extent). Opposite forces will be applied to each of the two nodes as particles. Also, forces will decrease linearly starting at some radius that I will define so that particle interactions do not get too strong and fling particles all over the place at high speed.

My main concern with the physics simulation is that multiplying all the chemicals together and by their physics coefficients might be very computationally expensive for large amounts of chemicals, so I might impose a limited set of physics interactions just like I did with the chemical reactions, but I will only do this if profiling shows this to be a significant burden, which it may not since it is only doing many multiplications, which aren't too difficult to compute, while the magnitude of the distance requires a square root to compute, which IS a requirement, because I have to divide the vector three times by this value to get from actual magnitude to inverse magnitude squared, thus I cannot just use it in its squared form. As things need to change in the simulation, I will update them and make them work a bit faster. This project will absolutely be using OpenCL eventually, but I will begin writing everything in C++ just to see how things turn out. I will also eventually optimize this by using a barnes-hut simulation so that the particle interactions don't get to expensive at large values.

As I said before, I will keep posting my ideas on this blog regarding this project. I have a feeling that very interesting things will result from the simulation. If I like it enough, maybe I will turn it into a game. Once the code is more fleshed-out, I will consider making it open-source, in which case I am likely to release it under the BSD license.

Tuesday, January 13, 2015

This is the first post of the blog, so I suppose some introduction might be necessary. The intention of this blog is to showcase the interesting projects I am working on for my friends and others. I work on a variety of interesting things and since some people might be interested in them, this will become the primary location for information on my personal projects and my thoughts regarding how people ought to be doing things.

In the past I have worked on a lot of cellular automatons in which each cell contained a genetic code that was designed to be mutated, mated, and tested against a variety of other randomly generated and mutated cells. Not long ago I had planned to work on a simple game, which I may still return to, that incorporated particle physics with "chemicals." The intention was that randomized creatures composed of particles would divide and consume the finite particle resources around them to fuel their growth, permitting competition among multiple species of such creatures. I have thought about this for a while and the idea has evolved, but it did not significantly change. I had tested whether it was feasible or not, and it seems to be a bit too computationally expensive regardless of how it is done.

However, I recently learned about reaction-diffusion simulations, which can be used to simulate real world chemical reactions and biological processes. These amazing things are generic and only have local interactions, making them amazingly easy to simulate massively in parallel. Here are some remarkable examples of emergent phenomena created by reaction-diffusion simulations:







The method by which these work is that a reaction occurs inside each cell every time there is a new cycle. One or more different "chemical" materials are each present in the cell, and each of these reacts each each other only within the same cell. In addition, some of the chemicals leave the cell and enter adjacent cells, which simulates the process of diffusion. Each chemical has its own "diffusion coefficient," which is used to determine what proportion of each chemical spills over into adjacent cells every cycle. Although the reaction can be made complicated to compute, this process is overall easy to compute in parallel, which makes it a prime candidate for optimization via OpenCL.

After learning how it worked, my initial thought was that a square grid is not a good way to carry out these simulations, since a square has eight neighbors: four neighbors in the cardinal directions with distance 1, and four neighbors in the diagonal directions with distance sqrt(2). Real simulation of 2d space would be better achieved by using hexagons, each with 6 equidistant neighbors. For a 3d simulation, I would personally use a rhombic-dodecahedral honeycomb to achieve the same effect. However, the trend in general is that adjacent cells diffuse with each other...but they don't have to be on a grid.

My primary intention is to somehow create an automaton with evolving creatures that uses reaction-diffusion equations to control the flow of chemicals, which could be used as food or other material, so I started by thinking of a few ways I could do this. I thought at first that I might try to create a separate "chemical" for each creature, but then I realized that this would become too computationally complex, and the chemicals would not be able to respond intelligently to their environment. The next thing I considered was to create a system that would have many emergent-phenomena, such that a replicating creature could develop with a genetic code very easily, much like Langston's Loops. This may be possible, and I may try this idea later, but I think that is something cellular automatons will do better at, and though I may be wrong, I believe it would be difficult to create such a rule-set.

I have decided that one possible way to achieve an interesting developing system is to make each simulated cell in the reaction-diffusion simulation have its own genetic code. Each one of these cells could identify its own species and act to its own benefit by attempting to divide and control the rate of diffusion via active transport all at the cost of food (pat of the reaction). Multiple chemicals could interact with each other in ways such that food may be destroyed and interesting things can be done to other cells, so that cells have a way to defend themselves and attack their neighbors. However, this model is not sufficient on its own because too many cells would need to be simulated and cells cant effectively interact with other cells to their own benefit, since I suspect that some equilibrium might be achieved in a finite space. Thus I plan to make the space effectively dimensionless from the perspective of cells. Each cell will connect in a mesh to a variable number of other cells. A new cell can appear anywhere when a cell divides, which will connect the two cells and split old connections between the two cells so that no new connections are made.

Now I could use this model simply by inserting random cells into the mesh and having them fight it out, trying to destroy cells by shorting out their food. However, I am going a step further. The plan is to generate random cells and place them in their own separate mesh, then place each of these cells as a particle into a 3d n-body particle simulation. Connections between cells will pull them together and chemicals will cause various forces to be applied between the cells. Now cells can effectively become colony-meshes and attack other colony-meshes using inter-particle forces, attaching to those colonies, then performing biological warfare by pumping food-depleting chemicals into the opponent or giving them chemicals that force them to repel and detach from each other. The possibilities for different adaptations in such a system is endless, and colonies could form complex structures and even neural-networks naturally as part of the chemical-diffusion system that worked along the meshes.

Hopefully anybody that has read this far is as excited as I am. I used this blog to flesh out my ideas and my thinking, but also to share the project. As the project develops, I will upload pictures, videos, and new information. All the while I will be learning this blog business as I go.