Design Considerations for BlueIllusion
back
I've of course done some thinking ere I have started to do the actual coding. First, I 've had to wrap my mind around the idea, that the computers memory is there and that the kernel and the processes are sitting in memory, at different places. The mmu then does the level of abstraction required by the os developer for achieving this: abstracting the process adress space to a continuous area from 0x00000000 to 4 GB. This can be done with paging - lovely, nice paging. It is a hell to grasp, but dead easy once one 's wrapped his brains around it. Then, there is the very idea of having a bunch of processes spinning around the cpu, managed by the kernel, which is basically nothing more but a message router - in a micro kernel. Of course, I have it do some more things: managing the job control blocks. There is one task in BlueIllusion, which talks directly to the job control management: the system/buffy task. It is responsible for all the low level grunt work like doing a fork (the basic work - duplicating the stack -> in order to copy the hardware state of the forked process), or copying data between two processes. System/buffy is also responsible for slaying processes upon exit().
Following, I present several things I've discovered whilst doing it.
Message passing
If considering writing a micro kernel, one should layout how he'll manage the message passing for there are several approaches to do this:
1. Process A delivers the message directly to process B: This is a strongly synchronized approach. If process B doesn't receive at the moment of sending, Process A is suspended and queued upon Process B's 'sending' queue, until Process B is receiving. Using this method, you don't need to care for message queues - but you enqueue processes.
2. Process A sends the message to Process B. The message is copied to a dedicated region in kernel space and attached to Process B's message queue (lets call it the Post box). The next time, Process B looks into its postbox, it will retrieve the message. You can have processes block for receiving a message or just stroll by and check whether there is something to fetch. One could call this method "store and forward message passing" - one could also call it message routing.
It isn't a bad idea to wrap some brains around the following problem: What, if Process A expects a message from Process C. It needs a possibility to communicate this to the message routing subsystem. If Process B sends a message to Process A, Process A shall not be woken up for it wants something from Process C. Process C sends a message to Process A. Then and only then shall the message routing subsystem wake up Process A.
What primitives are required to get message passing running? Two: send(message), receive(message).
Memory Management
In a microkernel environment, you have basically the task management (not process management!), the interrupts management and task scheduling, the message routing and some little needful things (no no evil ones), around which driver tasks, service processes and user processes are spinning.
Well, and there comes up a question: where the hell shall I put the memory management? In sense of heap management: give the kernel a dedicated allocator and a dedicated memory area to use - you might need two of them: one for the messages, and one for all the other stuff.
The other task of memory management: Process adress space management, keeping track of used pages (yes, lets talk about aging, it is nice, it is neat, it causes a warm fuzzy feeling beneath the toes) and assigning memory to processes as needed, you can split it up. To be more precise, you have to split this task up - or keep every aspect of memory management in kernel land to make it easy. A recommended method for managing process adress space: handle it in a per-process-manner. The actual process needs memory: allocate the memory and hand it out to the process. So you can keep the page allocating routines easy and straight forward. For this task of allocating/deallocating memory, you sould take into consideration, that the task performing such actions should be able to slip into adress spaces at needs (it loads the required pagedirectory and does what it has to do - slimy little weasel thou' it is.) Take those things into good consideration and put quite an amount of brainwork into them. It's worth doing good engineering here.
Thoughts about implementing reverse lookup lists for page book keeping in mmdrv
Given the following structures, I will implement a means to keep track of dealt-out pages and references to them:
//This here is the element which does the most crucial thing: keeping the physical
//adress of the page at hands and also keeping the refering processes in a list.
//If a page is referred to by any processes, it canna be pt back to the pool of free
//pages. This list of pages is kept in the element pagelist_t. I should take into
//consideration whether it wouldn't be a bad thing to organize this in a double
//linked list.
typedef struct page{
adresse_t phys_page;
union u{
pageowner_list_t *pol;
prozess_t *owner;
};
struct page next;
}page_t;
typedef struct pagelist{
page_t *head;
page_t *tail;
}pagelist_t;
//these are the elements needed to do the bookkeeping of processes referring to the
//page.
typedef struct pageowner_list{
pageowner_t *head;
pageowner_t *tail;
}pageowner_list_t;
typedef struct pageowner{
prozess_t *pr;
struct pageowner *next;
}pageowner_t;
This approach of course needs to be carefully evaluated, especially concerning need of memory. The more, I have to take care of the order correctnes of the algorithm for the pushing/popping of pages from/to the stack of free pages.
File Management
As we already know, a micro kernel consists of several processes which are responsible for all the grunt work: allocating memory, managing processes - and housekeeping permanent data storage like hard disks or floppies. Yes, about filesystems the talk is.
Say, we have a file system service process. It is responsible for keeping the File System on a Floppy/HD what so ever up to date (and eventually handle pipes and so forth...) The very basic -- the uttermost basic task of the file system service is to keep track of allocated/deallocated blocks and (in case of ext2) inodes - files, to which blocks are allocated - you also need blocks for the file/block management. an other task, a file system service usually performs, is keeping blocks in memory for quick acess. They are slow to retrieve from disk storage - compared to a simple memory access. So the file system service keeps the so called block cache. Now, think about it: you use paging, you have the file system service as a process of its own. It fetches blocks and keeps them in its adress space.
One has quite some work to do in order to transfer the data of a retrieved block to a user process which performs the read() call - or vice versa a write().
Now, lets think further: What, if we have several processes requesting file system services concurrently? In a single process file system server - as it is in current BlueIllusion release - you need to queue up loots of messages and the process is not always ready to receive them - quite some sorting of messages is needed to achieve what we want to have. So, why not introducing a worker thread and a task-dispatcher thread in the file system for the beginning. The worker thread just fetches the next task from the internal queue. The task-dispatcher thread picks messages from the post box, performs some privilege and ipc rights tests and schedules the request at the IO queue for execution. The IO queue is of course protected by a semaphor. *gg*
Of course, this scheme of scheduling requests can be done in the drivers too - But at the moment, I am happy with the IO scheduling in the file system service - as soon as I have implemented threads.
block cache:
Currently, there is only one cache system wide.
This is to be considered a really bad thing for the File service sports a thread for each mounted file system (which is responsible for the read/write requests).
by using one global cache, there are crossings to be likely: you order a block from cdrom and it finds a block which originates from floppy f. ex. that's not a nice thing.
So I daresay it is ok to have block cache structures per device and to flush this cache to disk (if it's writeable) upon unmount or sync.
IPC - Semaphores
In a micro kernel, interprocess communication is a NEED. How else should the services and the device drivers communicate with each other. But although the message passing system in BlueIllusion is quite performant, it is not meant to be used just for synchronisation between processes/threads, for messages have to be stored in message queues, space has to be allocated for them, and the more, they have to be copied. Remember: Store-and-forward message passing.
As this kind of IPC causes quite some overhead, one should consider to provide other means of IPC for simpler tasks too: Mutexes or Semaphores, and introduce an additional software interrupt for those to avoid Mixing up with the message passing kernel trap.
Semaphores are a kind of "global" available integer variables. Each Process which knows the key for a semaphor can access it: Probe/leave.
Upon Probing a semaphor for availability, the semaphor value is tested for being greater than zero. If this is true, the value of it is decremented by one or the value the process submits as an indicator of how many of the ressources the semaphor guards it will use. this will only work, if the semaphor value is greater or equal than the value it has to be decremented with. If the semaphor value is equal zero, the ressource it guards is not available. the Process has to wait or continue other work (when just polling the ressource) As this needs to be an atomic operation, Probe enqueues the process on the semaphors pending_ops queue and leaves.
Another process, which owns the guarded ressource at the moment, calls Leave to tell the kernel, that it leaves the critical region. Leave increments the value of the semaphor by the value the process submits and checks the pending_ops queue for dormant processes waiting for the ressource. The next one is woken up, its operation is executed and it enters the critical region. This way it continues as long as there are processes in the semaphors pending_ops queue.
Semaphores in BlueIllusion are Sets of Semaphor variables. they are Initialized upon startup if needed - so that the creation and initialization of a semaphor set is an atomic operation.
Sharing of IRQ-Lines
Imagine the following: You have one IRQ-Line, which is say shared by 4 PCI Devices. Now, the ISR Stub is called and want's to call the proper ISR, but HALT, which one? WHich device has fired the IRQ? who is responsible for say the Ethernet device on IRQ-line 9?
I can imagine having the following management structure for isr's:
#define MAX_IRQS 16
typedef struct isr_handler{
uint_t id; //osinternal handle
int (* irq_handler_t)(int irq);
struct isr_handler next;
}isr_handler_t;
typedef struct isr_list{
isr_handler_t *head;
isr_handler_t *tail;
}isr_list_t;
//and this is the array of IRQ-handlers:
isr_list_t irqhandler[MAX_IRQS];
//this array is initialized with NULL
A global handler takes the IRQ number, goes into this array and calls the first registered ISR,if one is present. If the ISR returns 0, everything went fine, else it is time to either call the next in chain or to call the default ISR - just dump a message to log or screen.
After all, this should be a fine and perfectly possible thing to do. One would actually need to ask a device on the line, if it has triggered the IRQ. Then, go ahead with treating it.
GUI-Service and direct exposing of framebuffer memory to an application
What, if we are about to do animations or playing back videos - and keep two or more animated windows open at a time - with all the rendering stuff?
First, I can easily imagine having the video driver register a shared memory entity for the frame buffer. The GUI-Service does the acquiring for the requesting application via an internal protocol call - say WIN_GETFRAMEBUFFER - and the GUI returns the name of the shared memory area alongside with a list of clips for the window in question - as well as a client area.
What about having it be a control of its own? Drawing operations on this control are performed directly on the framebuffer itself (only if not obscured of course). Or other: what about having the gui-service expose its drawing buffer to applications and have them perform those drawing operations directly onto this buffer?

I have given it, as you can see, a thought:
I will still keep the server side managed list of windows and clipping will be performed on the serverside too. I will also keep one offscreen drawing buffer, where all the drawing is performed a priori. If an area has changed, it is blitted to screen by the video driver (which will eventually perform the hardware accelerated function). The client side Library (an additional part of uiLib) will keep a window list too - and a list of clips per window, as well as some "client area", which cannot be overdrawn. It maps in the offscreen drawing buffer and draws to it. For each pixel, it will iterate through the list of clips (Lets check this one out, it it is performant enough.), say like this:
typedef clip{
int x1;
int y1;
int x2;
int y2;
}clip_t;
//upon getting the list of clips, we say: empty_list(&win->cliplist);
//upon receipt of each clip, we simply say: malloc(clip),
//then insert_element(&win->cliplist,clip);
//then there comes the drawing:
int set_pixel(win,x,y,color_t *c){
clip_t *lcl=NULL;
uint_t offset;
//we bail out if the pixel exceeds valid borders... here also screen borders
//should be checked!
if(xclient_x&&x>win->client_x2&&yclient_y&&y>win->client_y2) return -1;
set_iterator(&win->cliplist); //start iterating:
do{
lcl=get_iterator(&win->cliplist);
if(lcl&&x>=lcl->x1&&x<=lcl->x2&&y>=lcl->y1&&y<=lcl->y2){
offset=(y*xres+x)*3(4) //if in 32 bit mode or 24 bit mode
screen[offset]=c->r;
screen[offset+1]=c->g;
screen[offset+2]=c->b;
break;
}
}while(lcl);
return 0;
}
Events will be sent directly to the client in form of raw MOUSE_DOWN, MOUSE_UP stuff, which the client side looper has to translate. It just should scan the list of objects as usual, but additionally, there will be drawing stuff added to the registered objects, which the looper has to activate - drawing stuff to perform in default eventhandlers which are to be called prior to custom eventhandlers.
Here is the current state of BlueIllusion GUI:

As you can see, the application just orders windows, widgets and the server draws all the ordered stuff. Given this, we have client side management code to distinguish the different sources of events, and we have server management code to place the events, draw widget feedback accordingly and to perform clipping and exposing all the drawn stuff to screen finally. I 'm eager to keep the expose and redraw stuff as limited as possible and to regions as small as possible to provide snappy and quick drawing and user feedback.
... after having done some development and coding stuff, I've come to the conclusion, that moving the full window is feasible with the new approach too: we need 1. to change the real time policy to *preempt current*, 2. to map framebuffer memory into the address space of gui service, 3. to spark a thread from gui service, which is solely responsible for handling window drag and window resizing stuff -> alongside with a dedicated expose_area function and a set of functions like those in video driver which were responsible for the window moving stuff - displacement rectangle & co. That is: do clipping, redraw affected regions, expose_area: new x1,y1, old x2,y2.
This dedicated GUI_windowhandler thread 'd be of real time priority and spin around the cpu in cooperation with the video_gui_interface, which does the mouse drawing/moving stuff.
See the picture below for kinda layout of how I imagine this:

Networking
As development continues, I'm going to sketch several aspects of networking here.
Ip Fragmentation

Above sketched is an example of how an IP-Fragmnt list can be buildt. The first ip datagram which has the "more Fragments" flag set, triggers the creation of an ip fragment list for this ip packet with ip src, ip dest, ip id. The ip-header of the fragment is replaced by a stub, which holds pointers to following or preceding packets (I'm going to keep a single linked list - search and insert in one pass. deletion is in one direction either), and is inserted as first element to the list.
Then, successive packets whichhave the "more fragments" flag set are inserted as follows: each frag is inserted before or after: le->fragofffragoff+frag->fragsize(before) or le->fragoff+le->fragsize+1<=frag->fragoff(after or adjacent).
if all fragments are collected (sum of all frag sizes == ip packet size or some flag not set - we do have to look into the frag lists?), the packet is deprived of its ip header, the frags are joined (data portions only) into a single buffer list and ip calls tcp_input or udp_input or icmp_input. Upper layer after all. If fragmentlist times out it is dropped and no notification of any upper layer happens. Partner will need to resend the whole stuff.
Networking considerations in general:
Tips for vmware network testing:
edit->virtual network settings to configure your vmware devices.
vmware also offers DHCP & NAT. you can configure it completely to your liking.
For starters, I recommend you test with host-only or nat networking for the bridged device
is way too busy to carry out useable tests.
In windows-vmware it is possible to ping the NIC from windows side and also from the vm-hosted os.
So, the interface with address 192.168.0.1 (f. ex.) can be pinged from windows (host os) and from your own
OS (guest os). You need to take care in the virtual network settings, that netmask & emulated network on that
interface correspond to the network you configure in your guest os: if you configure your guest interface to
network 192.168.1.0 and the vmware-interface is configured to 192.168.0.0 then it will not respond to pings
and other packets from the guest os. I have yet to try this in Linux.
Also you need to take care of your packet checksums. If a checksum is invalid, the respective packet is dropped.
That's about it. YOu won't ever notice. You'd just wonder: why does this refuse to work. So, check your
checksum algorithms. Check if you apply HTONS & HTONL where it is due.
Networking in general:
Take a breath. Relax. Ok: Networking Stacks are like Kernels. Event Driven. They rely on a Kernel, true. But they
are as sophisticated. YOu need to incorporate request deferring mechanisms to queue Client requests, IP packets
before the ARP Protocol is done with its stuff, Client request for a ping ... and many more.
Furthermore, the drivers: You see, there is the Output thread of the nic (I tend to have this as a thread of its
own) and there is a thread which takes care of the interrupts which the nic triggers. Now, hold on. You say:
why not dealing with that stuff in an ISR? NO. It'd take too long to pry a packet out of the NIC ringbuffer
memory and pass it throu' the protocol stack. Have the ISR notify the thread and have the thread treat to the nic
appropriately! That's my proposal.
Layout of a network stack:
In general: Output functions connect the output side of the protocols to each other. Input functions connect the
input sides to each other. So you have f. ex. socket_write call-> tcp_output->ip_output->if_output (which occasionally calls arp_output)
and on the other hand you have f. ex. if_input (fetch a packet, check ethernet header)->ip_input (d the occasional
defraggling)-> tcp_input (do the got_data & ack & etc stuff)/udp_input -> satisfy a socket_read call.
Inside the network stack you should consider how to treat/create that packet stuff: I've chosen to implement a
buffer mechanism which deals out a buffer of a fixed size large enough to hold 1538 bytes. This buffers should be cocatenable
to chains of buffers/chains of bufferchains. also they should be resizeable on both ends (head/tail).
You pass a buffer chain/a buffer up/down the stack. It shrinks/grows in the process. The buffer is once created and
then a pointer to it is passed forth. TCP segmentation & Nback protocol nature requires that you keep the original
buffers in memory so tcp send creates copys of the portion to be sent. Only after the according ack has arrived you
can drop that partial buffer. The receiving part keeps the received segments in ascending sorted order (sorted by sequence number)
you send an ack for segments arrived, with which you can form contiguous data (so the stream is without holes). So the
tcp ack tells the sender: up to this segment I have received data OK. A segment which creates a hole in the data stream
won't be ack'd until the missing segment has arrived. Then, the last segment is ack'd, telling the sender: data till this
segment received. What about brief excursions to other topics? *rofl*
The thing with the NIC's:
Most of them sport a ring buffer mechanism. F. Ex. Lance as found in VMWARE, RTL8139, RTL8169. Others are
to be treated in PIO Mode (NE2000 compliant ones like the RTL8029?).
Let's stay with the ring buffer stuff: you need two buffers: one for receiving, one for transmitting. Then
you need memory for the ring buffer descriptors. where each descriptor points to a chunk in the transmit/receive buffer.
These descriptors are initialized so that they point to their portion of buffer memory, sport a flag like
"NIC_OWN" and a size field. The buffer chunks are usually max mtu (1538 bytes)
Mark the following: These memory areas need to reside in contiguous memory (you will need dma memory allocating
stuff in your kernel. Also you will need a mechanism to get the physical address of an allocated region). You
tell the NIC physical addresses for it knows naught about MMU & sorta.

NICs dealing with ringbuffers operate via DMA as pci bus masters. They circle throu their ring buffer descriptors as
long as none has a flag "NIC_OWN" set, and transmit. They take the next receive descriptor with the flag "NIC_OWN" and
write received data to the buffer memory the descriptor in question points to.
After each receive/transmit an interrupt is triggered to notify the driver of the happenings. We DO want
to now what has happened to our data, ok?
You read out a register of the NIC which tells: why the irq?
So, for a transmit IRQ, you check for error flags in the ring descriptors which aren't "NIC_OWN". Regardless of
the result, after checking you hand the descriptor back to the nic by setting the flag "NIC_OWN".
For a receive irq, it is similar, just that you, after SUCCESSFUL checking of the flags, you pry the data
out of the buffer chunk (you know it's virtual address by doing some maths) in the received size. That's done by
creating a buffer and copying the data over. Then you call if_input of the driver to do the protocol stuff.
Now the question arises: how do I send data? Simply: you take the next free transmit descriptor. you copy the data
in the length of your buffer to the buffer chunk the descriptor points to (do the maths to get virtual address.)
Then you set a flag "NIC_SEND" in the transmit descriptor as well as the message size field. There you go.
YOu see: receiving data from a nic is completely async. you don't know when a packet arrives and in which size.
YOu get packets. Lots of them. Deal with it.
More will come later.
Be assured: Graphics are to come for illustrating at least the message passing stuff. It is fun to discover such things - and to find out, how it all runs in a computer.
back
