ChaOS Source Notes

ChaOS Home    ChaOS Source Index    ChaOS Downloads    Diary    CTPP Home

I will post the ChaOS source code in the source index, bit by bit. Source code is the stuff that software is made of - language which is readable yet sufficiently structured to be unambiguously converted into instructions for a computer processor.

SOME DIARY NOTES ON THE CHAOS SOURCE INDEX:

30/10/2011 ChaOS v1.02 stable, being the container for the 32-bit self-compiling system. dec2011.iso imminent, with hard disk installer. ChaOS v1.03 now the container for 64-bit features, and development of the 64-bit self-compiling system.

11/8/2011 Major changes to ChaOS internals over the last year, dec2009.iso continually revised to keep abreast of developments, though function remains the same. VESA graphics modes now supported from the bootstrap loader, multiple thread launcher added, and 64-bit debugging to support development of ChaOS64.

29/5/2010 ChaOS FTP client much improved. Many source files are now uploaded direct from CFS ChaOS development drive to ctpp.co.uk on Zen server, using small FTP scripts.

22/5/2010 Development of ChaOS has now moved on to the native ChaOS filesystem CFS, which is now stable. Native CFS uses .html as the default extension for source files, so many filenames on this website are being changed to lower case .html, instead of upper case .HTM. Project make files previously with the file extension .LNK now changed to .link.

26/4/2010 CFS revised and substantially improved, prototype now in testing with native bootstrap. The CFS bootstrap requires EDD, older ChaOS systems without EDD will have to stick to a FAT16 bootstrap (these are generally 1990s machines). Typical performance is better than twice the speed of FAT16. File extension sets are now controlled by a logical drive attribute, which allows files on different filesystems with different file extensions to be considered equivalent when synching between drives, This provides instant bridging back to FAT16, in order to support the older machines.

30/3/2010 Longfilenames over FAT16 now working well, however during the last year the Microsoft FAT/longfilename patent pendulum swung back towards Microsoft. With this news, I am resurrecting the old ChaOS native filesystem (CFS) to ditch the last baggage of Microsoft from my system.

18/01/2010 ChaOS source code is still changing too fast to release a meaningful archive. Currently bomb-testing longfilename support. Files released so far have 8:3 filenames. Longfilenames will be used to eliminate file-extension conflicts with Windows XP and Linux before the ChaOS source tree is made public. Already, source files are all .HTM, ready for internet browsing, Subject to testing, these will become .html files.

29/11/2009 This year has been taken by the development of TCP/IP for ChaOS. To complement load-on-detect device drivers (of which COM.DRV below is one) ChaOS now has detachable protocol modules to support ARP, IP, TCP, UDP, DHCP, HTP, TELNET and FTP. Source code for these modules will appear on the ChaOS Source Index in due course. FTP access to native ChaOS server was achieved today. As a consequence, some ChaOS file extensions may change, to avoid conflict with .lnk (shortcut) and .job (Windows Task Object).

16/2/2009 Posted source code to COM.DRV load-on-detect device driver which includes GPS support

25/12/2008 As yet only a fraction of the ChaOS source code is in the source index because it changes daily as I prepare the downloads. Source code is viewable in the debugger after booting a ChaOS ISO CD (press Pause/Break key anytime to enter the debugger). These are the most up-to-date files.

CHAOS SOURCE NOTES START HERE:

Why write your own operating system? A good question. Most people will follow the crowd, it is the herd instinct. But any code you write for a given system may be trash when the operating system iself is superceded. Modern computer languages such as XML and PHP improve the reusability of code, but these are not much use when asking a computer to control a machine, or balance the company books. They also depend on the operating system beneath to provide consistent interpretation, which is not always the case. Back to head scratching and trial and error to get things done. I just decided one day it was better to spend my hours of trial and error on my own self-booting, self-modifying system.

ChaOS is written in the widely-used computer language called C, with some custom features. I wanted to create an operating system capable of rebuilding itself, with the minimum of fuss, so I wrote the ChaOS compiler (the program to convert the source code into a live program). The compiler has some non-standard features - I use shortcuts for the C inbuilt types, some special compiler directives, and an inline assembler which uses standard Intel mnemonics for instructions and processor registers, but uses C notation for data declarations. Since I started uploading these pages in September 2008 I decided to add further modifications to the ChaOS compiler to make the source code more internet-friendly, i.e. to allow simple .HTML files to be used for source code, and a new #include directive syntax which doesn't clash with HTML.

Modern computers are complex, so operating systems are equally complicated. If you are unfamiliar with the C, programming language, and/or Intel Architecture assembly language, you may find the ChaOS source code difficult to follow. Try searching through the reference documents, for the words in the source code which you do not understand.

To run a custom operating system, we must first understand what happens when a computer is switched on.

BIOS (Basic Input-Output System)

BIOS is a program which is stored on a ROM (Read-Only-Memory) chip inside every PC, and takes control as soon as the computer is powered up. BIOS is customised by the hardware manufacturer to suit their particular mainboard, and provides configuration services too (usually accessed by pressing the DELETE or F1 key as the PC starts up). BIOS also contains many useful routines which can be accessed via software interrupts. After hardware testing and initialisation BIOS goes into a routine called BIOS bootstrap. This is the mechanism which allows a user program such as Microsoft Windows, stored on disk or other medium, to take control of the computer.

Bootstrap

The PC bootstrap mechanism runs in 16-bit mode and was designed years ago to allow a computer to start from a floppy disk. It is disarmingly simple - read the first sector of the disk into a known memory location, and set the processor's instruction pointer to that location. The operating system designer ensures that the first sector (boot sector) of the disk (referred to as the boot disk) contains a meaningful sequence of processor instructions. With the advent of hard disks, a more powerful boot sector was adopted (the partition table) - still a small chunk of machine-code, but containing up to 4 distinct disk addresses where a boot sector could be located. This allowed up to 4 different operating systems to be present on one hard disk. Modern computers, through BIOS configuration, can read a boot sector from several sources, typically hard disk and CDROM, but also ZIP drive, and memory cards. (It's fun to boot ChaOS from a USB memory stick).

One disk sector is only 512 bytes, and no useful program is so small, so the boot sector usually contains the name of a larger file (the bootstrap), and sufficient code to locate and load it. In turn this program will load other files, in order to start the operating system proper. In this way the system pulls itself up by the bootstraps (like a cowboy putting his boots on?), hence the name. In MSDOS the bootstrap is IO.SYS, in Windows XP it is NTLDR. Have a look at the ChaOS boot sector code here*. See if you can find the name of the ChaOS bootstrap - LOADER.BIN. The whole of the ChaOS bootstrap is online in the ChaOS Source Index.

*  Note 11/8/2011 This bootstrap has been superceded. Latest ChaOS versions have a new bootstrap, to work with CFS, the ChaOS File System, rather than FAT16 disk layout.

LOADER.BIN loads the operating system image into memory above the first megabyte boundary, (i.e. above linear address 0x00100000), then performs a switch to 32-bit protected mode, which allows a maximum address space of 4 Gigabytes.

File System

Every operating system needs a file system, to locate system components and load them into memory. I've grown up with MSDOS, and early ChaOS was developed over MSDOS running a WATCOM DOS4G extender. So the first native ChaOS file system was FAT16 on hard disk, FAT12 on floppy disk. ChaOS will read-write FAT32 partitions too. The native ChaOS File System (CFS) superceded FAT16 in March 2010 and will feature on the next ISO download.

ChaOS Model

The ChaOS operating system model is 4Gb linear, small memory model, with no segmentation. This means segment registers are never reloaded during a session. Segment register loads use dozens of processor clock cycles as they enforce the segment protection mechanisms of Intel architecture processors. By avoiding these mechanisms, ChaOS speeds up control transfers such as operating system calls - a simple 32-bit relative near call suffices. ChaOS is single-threaded, so there are no task switches either. Task switches burn hundreds of processor clock cycles because they involve several segment register loads.

Hardware is serviced through interrupt handlers which post messages to the system queue. Background processes are serviced though a system-wide polling when the message queue is empty. So although ChaOS runs as a single processor task, it supports multiple concurrent processes.

ChaOS Executables

As an avid programmer, it has always irked me that the process of program creation (compiling and linking) using conventional operating systems always discards information necessary to reverse a compiled program into its original source code, and to reverse the program load process. In ChaOS program creation, nothing is discarded, and ChaOS executable files (.XEC files) are designed to contain all the information required to remake the program. The files start with a one-sector header (512 bytes) which contains links in the form of simple file offsets to all the important structures . In order to support source-line degugging, the whole of the source code used to create the program is embedded within the .XEC file, including header files. Also embedded is a symbol table and global type information. This is necessary to support the ChaOS dynamic-link mechanism. A table of relocations records all references to symbols in the code and data streams. Finally a fastrelocs table shorthands the relocation table into a series of file offsets to which the program load address is added to make the program run at its given location in 4Gb linear address space. This means that relocation patching is lightning fast. There is a downside - typically the inclusion of all this information make the program 6 times larger than a raw executable. With compression of the source and symbol tables the executable file size can be halved. Suppression of the source file information can reduce the size further. But the advantages outweigh the size problem - full source-line debugging, reversible relocation tables and fully type-safe dynamic linking.

Why reversible relocations? Two reasons, one is that a running program can, in a heartbeat, be reversed to it's disk image, moved to a new location, and repatched to run at the new location. The second reason is much more useful, it allows a program to hook calls in a program already running and redirect them to a later-loaded module. This allows the possibility that the code for a running function could be extracted into an editor, modified, recompiled and hooked into a system to test the effect of a bug fix - all without taking the system down for a full recompile, (keyhole surgery for software?). It also allows for two-way dynamic linking - which I am using in the ChaOS IP protocaol suite. Conventional dynamic linking allows a later-loaded program B to call functions in an earlier-loaded program A, but not vice-versa. With two-way dynamic-linking, as B loads, it can link calls to functions in A, and hook calls by A and direct them back into itself, a proper two-way street which is ideal for inserting protocols into a running network. And the reversible nature of the ChaOS design allows simple unloading of a test protocol. This is ideal for temporary network protocols to be loaded into a system in response to a virus attack, for example.

The ChaOS dynamic-link mechanism has other advantages during development too. Symbols tables of loaded modules are searched in reverse order of loading, so if a later-loaded module contains a function with the same name and prototype as an operating system function, any subsequent loading program will link into the later version of the function. This is great for testing modifications to operating system functions before updating the system itself. This too could be used in an emergency to redirect a virus attack.

Are these mechanisms insecure? Perhaps, But computer viruses only exist because thay can find large numbers of systems with the same DNA. Operating systems with fixed DNA are the sitting duck.

ChaOS compiler

The ChaOS compiler is a C compiler, with custom features. C++ style comments are preferred, though C style comments are accepted. The inline assembler accepts C++ style comments too. ChaOS runs a 4Gb flat memory model data addresses are compiled as 32-bit integers, and all functions (including operating system services) are accessed by a simple 32-bit near call. Unlike Linux, and Windows, ChaOS uses no paged memory so programmers can be sure transient information such as passwords and network secrets will never be written to a hard disk swapfile. Using 4Gb flat memory with no task-switches means the processor performs no segment loads, not even for interrupt handlers, so everything runs pretty fast. Structures cannot be passed to or returned from functions, but pointers to structures do the same job, and slot neatly into the model as they too are 32-bit addresses. This means integers and pointers returned from functions are always found in the EAX register. Functions returning floating point values leave the return value in register ST(0) on the maths chip, and the calling function retrieves the value from there. Again, the single-thread ChaOS design allows such a simplistic approach. (On pre-emptive multi-tasking systems such as Windows the maths chip state needs to be saved and restored at every task switch, absorbing more valuable processor cycles). With the advent of multi-core processors, it is logical to delegate multiple tasks to multiple cores, rather than ask one core to flit between between multiple tasks.

To write a useful program, the programmer needs access to functions which will access the target computer's resources, and these functions are usually provided by libraries which are combined with the programmers object files by a linker. Dozens of simple functions, such as memcpy() (memory copy) are linked into almost every executable file in an operating system such as Windows. This is an enormous exercise in duplication, but with good reason - since Windows is a pre-emptive multitasking system (you never quite know when the system will interrupt your program to take control of the processor to do something else), each user program needs it's own copy of library functions as many of them are non-reentrant. In ChaOS, the library functions are part of the operating system; they are written to use the stack wherever possible to ensure re-entrancy; and calls to simple functions such as memcpy are resolved at load-time using the ChaOS dynamic-link mechanism. As a result the code size of ChaOS programs is typically half that of the same project compiled for DOS4GW. This simplistic approach means that multiple processor cores can share library functions too (because each core uses a different stack).

Some non-standard features allow a simple HTML file to be used as source code for the ChaOS compiler. A simple HTML wrapper forces web browsers to scan the source code for HTML constructs. This allows hyperlinks to be embedded within source code comments.

The ChaOS compiler CC.XEC processes source files into .CBJ files. The extension .CBJ is a marker for the ChaOS linker CL.XEC to locate the modules needed to build a project. These are specified using a makefile, (.LNK). The main operating system file OS.XEC is made by linking together 78 .CBJ files, and these files are specified in the makefile OS.LNK. Although the file formats for OS.XEC and a loadable program like CC.XEC (or ED.XEC the ChaOS source file editor) are identical, OS.XEC is special because it contains no external references when linking is complete. I call this a .standalone .XEC program, an essential characteristic of the first-loaded operating system file.

OS.XEC also contains code to load other .XEC files and run them. Usually these programs contain references which the linker left unresolved. These are unresolved external symbols. The loader in OS.XEC searches the symbols tables of other loaded .XECs (including OS.XEC itself) to hook programs into the system. Not only are the symbol names matched, the types are matched too. Everything has to match exactly or the program will not load. I call this type-safe dynamic linking, and have found it to be a powerful tool in future-proofing my software.

Conventionally, operating system structures and APIs only change when vendors release major new versions, so type-safe linking isn't needed; older software just becomes broken, unusable and fades away. ChaOS is designed to be modified incrementally. Phase errors (mismatches between one program and another accessing the same data) are detected at load time before the offending (maybe older) program gets a chance to run. Phase errors are probably the nastiest kind of software bug, because they may pass undetected into a finished program without causing a crash. They can cause debuggers to display data incorrectly and can lie in wait for months or years before bringing a system down. So in ChaOS, if I change an operating system structure, perhaps changing the name of a data field or adding an extra one, any programs which use the old structure will be terminated by the loader, with an informative error message. This is not as bad as it sounds. Proper programming discipline means such shared structures will be contained in common #include files, so I simply switch to the source directory for the offending project, type make, and the compiler and linker will update the project's .XEC file, resynchronising data structures with the updated operating system.

A consequence of this design it that the .CBJ files produced by the ChaOS compiler are almost identical to .XEC files. They are in fact one-module .XEC files, capable of being loaded and executed as long as the operating system contains all the unresolved externals!

Inbuilt types

For many years I have used #define directives to create shorthand types,e.g.

#define      UL      unsigned long

Then, instead of declaring

unsigned long n;

I write

UL   n;

This saves a lot of typing. Whilst the ChaOS compiler accepts standard data declarations, it looks for the shorthands first, to speed up compilation. ChaOS is for Intel processors, which are known as little-endian due to the way memory is organised. But the world is not just Intel, so I have inbuilt big-endian integers too (as found in Motorola/Apple Macs, and in abundance on the internet). The ChaOS compiler handles the byte-swapping necessary to mix Motorola and Intel integers freely in arithmetic expressions. These inbuilt types, and C structures are understood by the inline assembler - I find this much easier than declaring data the old MASM way. Also the assembler uses C notation for constants, i.e. 256 is 0x100, or 100000000b (binary) but not 100h. I don't use octal notation. Floating point numbers in ChaOS are all 64-bit IEEE format, and run on the 486/Pentium maths co-processor. There is no floating-point emulation, so if you want to run ChaOS on a 386 (you idiot), it's integer-only or you need a 387 maths chip.

ChaOS Two-character Data Declarators

 ChaOS declarator (shortcut)

 Equivalent 'C' declarator

 Machine storage

 VD

 void

 none

CH

char

8-bit signed

SC

char, signed char

8-bit signed

 UC

 unsigned char

 8-bit unsigned, little-endian

SI

int, signed int

16-bit signed, little-endian

 UI

 unsigned int

 16-bit unsigned, little-endian

 SL  

 long, signed long

 32-bit signed, little-endian

 UL

 unsigned long

 32-bit unsigned, little-endian

 Sm

No equivalent 'C' declaration

 16-bit signed, big-endian

 Um

No equivalent 'C' declaration

 16-bit unsigned, big-endian

 SM

No equivalent 'C' declaration

 32-bit signed, big-endian

 UM

No equivalent 'C' declaration

 32-bit-unsigned, big-endian

 FL

 float

 64-bit IEEE double-floating point

 DB

 double

 64-bit IEEE double-floating point

Special compiler directives

#pragma breakpoints on                - causes compiler to generate source-line debug breakpoints
#pragma breakpoints off               - suppresses source-line debug breakpoints
#pragma breakpoint                    - place hard breakpoint in code stream 
#pragma rmode32                       - causes assembler to generate code for 32-bit code segments (default)
#pragma rmode16                       - causes assembler to generate code for 16-bit code segments
#pragma inlinedata                    - causes data spaces to be generated in the code stream
#asm                                  - causes compiler to switch to assembly language (when outside 'C' code blocks)
#endasm                               - causes compiler to switch to 'C'
#includebin                           - includes raw file into data stream
#includebin UC* bs="bootsect.bin";   - raw file data assigned to a global data symbol (pretty useful)

rmode16, and inlinedata are used extensively in the ChaOS partition table, boot sector and bootstrap code . #asm and #endasm must be used in matched pairs

Function Overloading

One of the powerful features of 'C' is that the same symbol name can be used to invoke different functions, according to the type of the function arguments. In all the commercial 'C' compilers I have used, this is achieved by 'name-mangling' in the symbol table, to generate unique names for the linker to resolve. In ChaOS I have a global type system, so functions are made unique by the combination of a name plus the global function type. No name mangling is necessary because the linker understands the types too. The type system is part of the operating system, and is called by the compiler and linker when compiling programs. The program loader uses the type system to resolve dynamic links when loading programs. This means operating-system calls in ChaOS are 'C' linkage, and type-safe. As added security the type system requires that argument names in function prototypes match exactly those in the function declaration. This is a departure from commercial C and C++ compilers which allow any argument name in a prototype (including none!). This is not an issue if common header files are used by the operating system and the user program, and enforces discipline when posting function prototypes to header files.

Inline Assembler

The assembler can be started at any point in the source code using the compiler directive #asm. Compiler directives begin with a hash (#) which must be the first character on a line. The complementary directive #endasm switches the assembler off. Data declarations inside the assembler are made using the two-charactor data declarators above. The C compiler and the assembler share the same namespace and symbol table to allow easy transfer of control or data between the two.

#asm
function::                     //a global code label
      mov      eax,3
      int      0x10
      ret
#endasm

This function could be called from within a C code block like this:

               function();

Calling a C function from assembly language is easy, linkage will be to the first function with a matching name. In the absence of name-mangling, this could one of several possible targets. When calling out from assembly language to C a unique function name always produces the desired result. This method is used to transfer control from the ChaOS kernel startup module to osmain(), see startup.html

#asm
         call   osmain
#endasm 

Although #asm and #endasm can be used inside 'C' functions, by using the keyword asm, assembly language can be used in blocks enclosed by braces. Here are some examples:

         asm{//assembly language here}  
         asm{cli} 
         asm
               {
                  mov   eax,3
                  int   0x10
               }
         if(systemerror){asm{call systemexit,}}

This syntax is more intuitive, the closing brace switches off the assembler and asm can appear anywhere in the C code stream. Remember once the assembler is switched on it will expect only one processor instruction per code line, with parameters separated by commas, i.e.

         if(systemerror){asm{mov eax,-1 call systemexit,}}      //syntax error
         if(systemerror){asm{mov eax,-1 
                             call systemexit,}}                 //OK

The asm{.......} syntax within a function allows local (stack) variables to be accessed in assembler instructions, much easier than using assembler macros to create and destroy stack frames.

UL assemblyfunction(UL num1,UL num2)
{
asm   {
            mov   eax,num1
            mov   ecx,num2
            add   eax,ecx
         }
//return value is the integer sum of num1 and num2
}
       UL   q=assemblyfunction(17,35);      //value stored in q will be 52
 

Mixing the best features of 'C' and Assembly language makes some functions very simple indeed, here is the ChaOS sin function, from math.htm:

DB   sin(DB n)      //ANSI C 7.5.2.6
{// return sine of angle n (in radians)
asm   {
           fld   n
           fsin     
      }
//return value is in ST(0), the stacktop register of the maths co-processor
}      

Comments (inline assembler) 

Comments can be C++ style, i.e.

            // anywhere on a line means the rest of the line is skipped by the assembler
            //this is a comment
//so is this

and comments can be assembly-style, i.e.

                        ;  anywhere on a line means the rest of the line is skipped by the assembler
      ; so this is also a comment

Reference Documents

C language reference

Intel64 and IA-32 Architectures Software Developer's Manuals:

Volume 1: Basic Architecture

Volume 2A: Instruction Set Reference, A-M

Volume 2B: Instruction Set Reference, N-Z

Volume 3A: System Programming Guide

Volume 3B: System Programming Guide

Glossary

Program: A program is a series of instruction bytes (referred to as machine-code) which are meaningful to a processor as a sequence of tasks to be performed, such as memory retrieval, arithmetic operations, memory storage and hardware input/output. A program must be located in computer memory for the processor to access the instructions. When the instruction pointer of a processor is set to the starting memory address of the loaded program, the program will run.