nullSystem CallSystem CallConceptsConceptsSystem calls provide the interface between user programs and kernel.
Abstracted hardware interface
Security and stability
Allows virtualization
Mode, Space, ContextMode, Space, ContextMode: hardware restricted execution state
restricted access, privileged instructions
user mode vs. kernel mode
“dual-mode architecture”, “protected mode”
Intel supports 4 protection “rings”: 0 kernel, 1 unused, 2 unused, 3 user
Space: kernel (system) vs. user (process) address space
requires MMU support (virtual memory)
“userland”: any process address space; there are many user address spaces
reality: kernel is often mapped into user process space
nullContext: kernel activity on “behalf” of ???
process: on behalf of current process
system: unrelated to current process (maybe no process!)
example “interrupt context”
blocking not allowed!
Interrupts and exceptionsInterrupts and exceptionsInterrupts - async device to cpu communication
example: service request, completion notification
aside: IPI – interprocessor interrupt (another cpu!)
system may be interrupted in either kernel or user mode
interrupts are logically unrelated to current processing
Exceptions - sync hardware error notification
example: divide-by-zero (AU), illegal address (MMU)
exceptions are caused by current processing
Software interrupts (traps)
synchronous “simulated” interrupt
allows controlled “entry” into the kernel from userland
Cost of Crossing the “Kernel Barrier”Cost of Crossing the “Kernel Barrier”more than a procedure call
less than a context switch
costs:
vectoring mechanism
establishing kernel stack
validating parameters
kernel mapped to user address space?
updating page map permissions
kernel in a separate address space?
reloading page maps
invalidating cache, TLB
Hello World –User Program’s ViewHello World –User Program’s View> cat >hello.c
#include
int main(int argc, char *argv[]) {
printf("Hello world!\n");
return 0;
}
> gcc –o hello hello.c
> ltrace ./hello
__libc_start_main(0x8048394, 1, 0xbffff914, 0x80483b8, 0x8048400
printf("Hello world!\n"Hello world!
) = 13
+++ exited (status 0) +++
null>strace ./hello
execve("./hello", ["./hello"], [/* 40 vars */]) = 0
uname({sys="Linux", node="tara", ...}) = 0
brk(0) = 0x804a000
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
old_mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fe9000
open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=50648, ...}) = 0
old_mmap(NULL, 50648, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7fdc000
close(3) = 0
access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory)
open("/lib/tls/i686/cmov/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\215Y\1"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=1222116, ...}) = 0
nullold_mmap(NULL, 1232428, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xb7eaf000
old_mmap(0xb7fd1000, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x121000) = 0xb7fd1000
old_mmap(0xb7fda000, 7724, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7fda000
close(3) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7eae000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7eae080, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
munmap(0xb7fdc000, 50648) = 0
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 3), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7fe8000
write(1, "Hello world!\n", 13Hello world!) = 13
munmap(0xb7fe8000, 4096) = 0
exit_group(0) = ?
nullApplication
Calls printf()
C library (glibc)
printf() function issues write() system call.
Kernel
write() system call manages output.
sets global errno variable if an error occurs.
returns to user application
System Calls vs. Library Calls System Calls vs. Library Callsman 2
historical evolution of # of calls
Unix 6e (~50), Solaris 7 (~250)
Linux 2.0 (~160), Linux 2.2 ( ~190), Linux 2.4 (~220) , Linux 2.6.9 (~280)
library calls vs. system call possibilities:
library call never invokes system call
library call sometimes invokes system call
library call always invokes system call
system call not available via library
can invoke system call “directly” via assembly codeLinux System CallsLinux System CallsBroad system call categories:
files, i/o, devices
memory, processes
ipc, time, misc
System call listing:
/usr/src/linux/include/asm-i386/unistd.h
arch/i386/kernel/entry.S
include/asm-i386/unistd.hinclude/asm-i386/unistd.h/*
* This file contains the system call numbers.
*/
#define __NR_exit 1
#define __NR_fork 2
#define __NR_read 3
#define __NR_write 4
#define __NR_open 5
. . . . . .
#define __NR_getpid 20
. . . . . .arch/i386/kernel/entry.Sarch/i386/kernel/entry.S.data
ENTRY(sys_call_table)
.long sys_restart_syscall /* 0 - old "setup()" system call, used for restarting */
.long sys_exit
.long sys_fork
.long sys_read
.long sys_write
.long sys_open /* 5 */
.long sys_close
.long sys_waitpid
.long sys_creat
.long sys_link
.long sys_unlink /* 10 */
.long sys_execve
.long sys_chdir
.long sys_time
.long sys_mknod
.long sys_chmod /* 15 */
.long sys_lchown16
.long sys_ni_syscall /* old break syscall holder */
......Making a System CallMaking a System CallSoftware Interrupt
Historically: int $0x80
Modern: sysenter
System Call Number
Put in %eax register before interrupt
sys_call_table in arch/i386/kernel/entry.S
Parameters
1-5 args: %ebx, %ecx, %edx, %esi, %edi
6+ args: one register has pointer to user space params
Returning
Return from software interrupt: iret or sysexit
Return value stored in %eax registerSystem Call Macros
include/asm-i386/unistd.hSystem Call Macros
include/asm-i386/unistd.hThese macros (_syscall0) use the inline assembly feature of gcc.
#define _syscall0(type,name) \
type name(void) \
{ \
long __res; \
__asm__ volatile ("int $0x80" \
: "=a" (__res) \
: "0" (__NR_##name)); \
__syscall_return(type,__res); \
}
#define _syscall2(type,name,type1,arg1,type2,arg2) \
type name(type1 arg1,type2 arg2) \
{ \
long __res; \
__asm__ volatile ("int $0x80" \
: "=a" (__res) \
: "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2))); \
__syscall_return(type,__res); \
}
null#define __syscall_return(type, res) \
do { \
if ((unsigned long)(res) >= (unsigned long)(-125)) { \
errno = -(res); \
res = -1; \
} \
return (type) (res); \
} while (0)
System Call Entry
arch/i386/kernel/entry.SSystem Call Entry
arch/i386/kernel/entry.SENTRY(system_call)
pushl %eax # save orig_eax
SAVE_ALL
GET_THREAD_INFO(%ebp)
# system call tracing in operation
testb $(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT),TI_flags(%ebp)
jnz syscall_trace_entry
cmpl $(nr_syscalls), %eax # is eax a correct number ?
jae syscall_badsysnullsyscall_call:
call *sys_call_table(,%eax,4) #call the service routine
movl %eax,EAX(%esp) # store the return value
syscall_exit:
cli # make sure we don't miss an interrupt
# setting need_resched or sigpending
# between sampling and the iret
movl TI_flags(%ebp), %ecx
testw $_TIF_ALLWORK_MASK, %cx # current->work
jne syscall_exit_work
restore_all:
RESTORE_ALLnull#define SAVE_ALL \
cld; \
pushl %es; \
pushl %ds; \
pushl %eax; \
pushl %ebp; \
pushl %edi; \
pushl %esi; \
pushl %edx; \
pushl %ecx; \
pushl %ebx; \
movl $(__USER_DS), %edx; \
movl %edx, %ds; \
movl %edx, %es;#define RESTORE_INT_REGS \
popl %ebx; \
popl %ecx; \
popl %edx; \
popl %esi; \
popl %edi; \
popl %ebp; \
popl %eax
#define RESTORE_REGS \
RESTORE_INT_REGS; \
1: popl %ds; \
2: popl %es; \
.......
#define RESTORE_ALL \
RESTORE_REGS \
addl $4, %esp; \
1: iret; \More about System CallMore about System Callsys_foo, do_foo idiom
all system calls proper begin with sys_
often delegate to do_ function for the real work
asmlinkage
gcc magic to keep parameters on the stack
avoids register optimizations
sys_ni_syscall
just return ENOSYS!
fills “holes” for obsolete syscalls or library implemented callsnullSystem call name: getpid()
System call function: sys_getpid()
asmlinkage long sys_getpid(void)
{
return current->tgid;
}
Adding a System CallAdding a System CallWrite system call function
Add entry to end of sys_call_table
In arch/i386/kernel/entry.S add
.long sys_mycall
Define system call number for user.
In include/asm-i386/unistd.h
#define __NR_mycall 289
Compile kernel
Calling your new syscallCalling your new syscall#include
#define __NR_current_time 289
_syscall0(long, current_time)
#include
int main()
{
long retval = 1;
retval = current_time();
printf("The return value is %ld\n", retval);
return 0;
}