The first thing we need to know is how the SYSCALL instruction works. For this, we will refer to the description in the Intel Manual[1]

   SYSCALL invokes an OS system-call handler at privilege level 0. It does so
   by loading RIP from the IA32_LSTAR MSR (after saving the address of the 
   instruction following SYSCALL into RCX). (The WRMSR instruction ensures
   that the IA32_LSTAR MSR always contain a canonical address.)
   SYSCALL loads the CS and SS selectors with values derived from bits 47:32 
   of the IA32_STAR MSR. However, the CS and SS descriptor caches are not 
   loaded from the descriptors (in GDT or LDT) referenced by those selectors.
   Instead, the descriptor caches are loaded with fixed values. See the 
   Operation section for details. It is the responsibility of OS software to
   ensure that the descriptors in GDT or LDT referenced by those selector 
   values correspond to the fixed values loaded into the descriptor caches;
   the SYSCALL instruction does not ensure this correspondence.

Now we know from what location we load the address of syscall entry point, the values for our new %cs and %ss, and the value to mask the RFLAGS. All of these values come from Model Specific Registers. We are going to use a kld example of syscall so that we can extract the value of these registers. We are using the example at '/usr/share/examples/kld/syscall'. The only thing we need to change is the function hello on 'module/syscall.c', and add 3 lines:

   uprintf("IA32_LSTAR :  %lx\n", rdmsr(0xc0000082));
   uprintf("IA32_STAR  :  %lx\n", rdmsr(0xc0000081));
   uprintf("IA32_FMASK :  %lx\n", rdmsr(0xc0000084));

If you asked yourself where we get the values to pass to rdmsr, the answer is in file '/usr/include/x86/specialreg.h'. Now, we just need to compile this kld module, load and call the syscall implemented using the caller from the example as following:

   # make
   # kldload module/syscall.ko
   # test/call
   IA32_LSTAR :  ffffffff80f57dc0
   IA32_STAR  :  33002000000000
   IA32_FMASK :  4701
   # kldunload syscall

We now have all we need to understand what is going on when we issue a SYSCALL, but we want to know what happens after the execution of the SYSCALL instruction too. At this point, we should take a look at the kernel and see if there is some label that marks the address stored on IA32_LSTAR(new RIP)

   # readelf -a /boot/kernel/kernel | grep ffffffff80f57dc0
   12667: ffffffff80f57dc0     0 FUNC    GLOBAL DEFAULT    5 Xfast_syscall_pti
   52600: ffffffff80f57dc0     0 FUNC    GLOBAL DEFAULT    5 Xfast_syscall_pti

Once we locate the Xfast_syscall_pti label in the kernel, we just need to do a search to find where our fast syscall entry point is in the source code. The file that contains the function to initialize the fast system call is '/usr/src/sys/amd64/amd64/machdep.c'. This is the function in question:

   /* Set up the fast syscall stuff */
       uint64_t msr;
       msr = rdmsr(MSR_EFER) | EFER_SCE;
       wrmsr(MSR_EFER, msr);
       wrmsr(MSR_LSTAR, pti ? (u_int64_t)IDTVEC(fast_syscall_pti) :
       wrmsr(MSR_CSTAR, (u_int64_t)IDTVEC(fast_syscall32));
       msr = ((u_int64_t)GSEL(GCODE_SEL, SEL_KPL) << 32) |
           ((u_int64_t)GSEL(GUCODE32_SEL, SEL_UPL) << 48);
       wrmsr(MSR_STAR, msr);
       wrmsr(MSR_SF_MASK, PSL_NT | PSL_T | PSL_I | PSL_C | PSL_D);

Here we can see exactly how IA32_LSTAR MSR gets it's values. All of these macros will be explained later on in this write up.

The next important file to look for is '/usr/src/sys/amd64/amd64/exception.S' which is the file that contains the real entry point to fast syscall. The following is comment which describes the fast syscall entry point.

    * Fast syscall entry point.  We enter here with just our new %cs/%ss set,
    * and the new privilige level.  We are still running on the old user stack
    * pointer.  We have to juggle a few things around to find our stack etc.
    * swapgs gives us access to our PCPU space only.
    * We do not support invoking this from a custom segment registers,
    * esp. %cs, %ss, %fs, %gs, e.g. using entries from an LDT.

The fast syscall entry point of freebsd-amd64 is implemented in the file '/usr/src/sys/amd64/amd64/exception.S' and starts at line 430.


SUPERALIGN_TEXT macro is defined at '/usr/include/machine/asmacros.h', line 61:

   #define SUPERALIGN_TEXT .p2align 4,0x90 /* 16-byte alignment, nop filled */

As we can see in the source code comment, we should have the entry point with 16-byte aligment, filled with NOPs. Now we advance to line 431.

   431| IDTVEC(fast_syscall_pti)

IDTVEC macro is defined at '/usr/include/machine/asmacros.h' at line 165-166:

    * Convenience macro for declaring interrupt entry points.
   #define IDTVEC(name)    ALIGN_TEXT; .globl __CONCAT(X,name); \
            .type __CONCAT(X,name),@function; __CONCAT(X,name):

This function-like macro uses two macros in the body. ALIGN_TEXT macro, which is defined at '/usr/include/machine/asmacros.h' at line 56-60, with the curious equal value of the SUPERALIGN_TEXT macro and a probably useless #ifdef:

   #ifdef GPROF
   #define ALIGN_TEXT  .p2align 4,0x90 /* 16-byte alignment, nop filled */
   #define ALIGN_TEXT  .p2align 4,0x90 /* 16-byte alignment, nop filled */

The __CONCAT funcion-like macro is defined at '/usr/src/sys/sys/cdefs.h' on line 156-157 and the comment in the source-code tells us what it does:

   * The __CONCAT macro is used to concatenate parts of symbol names, e.g.
   * with "#define OLD(foo) __CONCAT(old,foo)", OLD(foo) produces oldfoo.
   * The __CONCAT macro is a bit tricky to use if it must work in non-ANSI 
   * mode -- there must be no spaces between its arguments, and for nested
   * __CONCAT's, all the __CONCAT's must be at the left.  __CONCAT can also
   * concatenate double-quoted strings produced by the __STRING macro, but
   * this only works with ANSI C..... /*
   #define __CONCAT1(x,y)  x ## y
   #define __CONCAT(x,y)   __CONCAT1(x,y)

Back to the call of IDTVEC macro we can see the macro expands to form an interrupt entry point like:

   .p2align 4,0x90
   .globl Xfast_syscall_pti
   .type Xfast_syscall_pti,@function

I noticed the X used on __CONCAT call inside IDTVEC call and thought it was another macro, but I didn't find any definition for X. So I took a look another look at the kernel and searched for the *fast_syscall_pti symbol:

   $ readelf -a /boot/kernel/kernel | grep fast_syscall_pti
   12667: ffffffff80f57dc0     0 FUNC    GLOBAL DEFAULT    5 Xfast_syscall_pti
   52600: ffffffff80f57dc0     0 FUNC    GLOBAL DEFAULT    5 Xfast_syscall_pti

Yes, I was stupid and discovered that X is just a symbol used on the __CONCAT call to form the Xfast_syscall_pti symbol. Now we can advance to line 432.

   432|    swapgs

I never used or have seen swapgs instruction before, so I needed to check out the Intel Reference Manual[1] to figure out what is happening here. I learned that this instruction is used to access kernel structures (PCPU space, which is referenced later on). The description is:

   SWAPGS exchanges the current GS base register value with the value 
   contained in MSR address C0000102H (IA32_KERNEL_GS_BASE). The SWAPGS
   instruction is a privileged instruction intended for use by system soft-
   By design, SWAPGS does not require any general purpose registers or memory
   operands. No registers need to be saved before using the instruction.
   SWAPGS exchanges the CPL 0 data pointer from the IA32_KERNEL_GS_BASE MSR 
   with the GS base register. The kernel can then use the GS prefix on normal
   memory references to access kernel data structures.

And the basic operation of the swapgs can be demonstrated with:

   tmp = GS.base;
   GS.base = IA32_KERNEL_GS_BASE;
   IA32_KERNEL_GS_BASE = tmp;

Before we discuss the GS register again, let's look at line 433:

   433|    movq    %rax,PCPU(SCRATCH_RAX)

The PCPU macro is defined at '/usr/include/machine/asmacros.h' on line 157:

   * Access per-CPU data.
   #define PCPU(member)    %gs:PC_ ## member

Now we can see why swapgs is used on line 432. It's because now we need to access some per-CPU data, and this is done using the GS segment register loaded from IA32_KERNEL_GS_BASE MSR after the use of swapgs, and the macro used on line 433 expands to the first use of the KERNEL_GS_BASE:

   movq    %rax,%gs:PC_SCRATCH_RAX

As we can see the PC_SCRATCH_RAX symbol needs to be defined at this point, so let's figure out where this symbol is defined.

   # grep -s -d recurse "PC_SCRATCH_RAX" /sys/*
   /sys/amd64/amd64/genassym.c:ASSYM(PC_SCRATCH_RAX, offsetof(struct pcpu, 

We can see the symbol PC_SCRATCH_RAX is passed to macro ASSYM in the file '/sys/amd64/amd64/genassym.c' at line 216, the macro that uses this is defined at '/usr/src/sys/sys/assym.h' on line 37-42:

   #define ASSYM_BIAS      0x10000 /* avoid zero-length arrays */
   #define ASSYM_ABS(value)    ((value) < 0 ? -((value) + 1) + 1ULL : (value))
   #define ASSYM(name, value)                            \
   char name ## sign[((value) < 0 ? 1 : 0) + ASSYM_BIAS];                \
   char name ## w0[(ASSYM_ABS(value) & 0xFFFFU) + ASSYM_BIAS];           \
   char name ## w1[((ASSYM_ABS(value) & 0xFFFF0000UL) >> 16) + ASSYM_BIAS];      \
   char name ## w2[((ASSYM_ABS(value) & 0xFFFF00000000ULL) >> 32) + ASSYM_BIAS]; \
   char name ## w3[((ASSYM_ABS(value) & 0xFFFF000000000000ULL) >> 48) + ASSYM_BIAS]

The ASSYM macro itself doesn't seem to create/define the symbol itself, what it does is create five char arrays. I run the C pre-processor on '/sys/amd64/amd64/genassym.c' to look at the output of expanded macros and got:

   $ cc -E /sys/amd64/amd64/genassym.c
   char PC_SCRATCH_RAXsign[((offsetof(struct pcpu, pc_scratch_rax)) < 0 ? 1 : 
   0) + 0x10000]; 
   char PC_SCRATCH_RAXw0[(((offsetof(struct pcpu, pc_scratch_rax)) < 0 ? -
   ((offsetof(struct pcpu, pc_scratch_rax)) + 1) + 1ULL : 
   (offsetof(struct pcpu, pc_scratch_rax))) & 0xFFFFU) + 0x10000]; 
   char PC_SCRATCH_RAXw1[((((offsetof(struct pcpu, pc_scratch_rax)) < 0 ? 
   -((offsetof(struct pcpu, pc_scratch_rax)) + 1) + 1ULL : 
   (offsetof(struct pcpu, pc_scratch_rax))) & 0xFFFF0000UL) >> 16) + 0x10000]; 
   char PC_SCRATCH_RAXw2[((((offsetof(struct pcpu, pc_scratch_rax)) < 0 ? 
   -((offsetof(struct pcpu, pc_scratch_rax)) + 1) + 1ULL : 
   (offsetof(struct pcpu, pc_scratch_rax))) & 0xFFFF00000000ULL) >> 32) + 0x10000];
   char PC_SCRATCH_RAXw3[((((offsetof(struct pcpu, pc_scratch_rax)) < 0 ? 
   -((offsetof(struct pcpu, pc_scratch_rax)) + 1) + 1ULL : 
   (offsetof(struct pcpu, pc_scratch_rax))) & 0xFFFF000000000000ULL) >> 48) 
   + 0x10000];

None of these arrays have a name like the ones used on the macro expanded on line 433. So I read the includes on /usr/src/sys/amd64/amd64/exception.S again and find the following include on line 45.

   #include "assym.s" 

This file is included like a local file, so I need to figure out where this file is created. I did a search on system Makefiles and got.

   # grep -s -d recurse "assym.s" /usr/src/sys/conf/ 
   /usr/src/sys/conf/ $S/kern/ genassym.o

Now we can take a look on '/usr/src/sys/conf/' and see how the 'assym.s' is created, at line 188-189:

   assym.s: $S/kern/ genassym.o
       NM='${NM}' NMFLAGS='${NMFLAGS}' sh $S/kern/ genassym.o > ${.TARGET}

We can see now where the PC_SCRATCH_RAX symbol is defined. The '' script receives a 'genassym.o' object and produces the 'assym.s' file. You can take a look at '/usr/src/sys/kern/' to see the functions to do this. The output is just a bunch of defines in the following format:

   printf("#define\t%s\t%s0x%s\n", $3, sign, w)

Now that we know how the ASSYM macros are used and transformed to assym.s later, we can now go back and see what a pcpu struct used in the ASSYM macro represents. The pcpu struct is defined at '/usr/src/sys/sys/pcpu.h'. We can reproduce the struct declaration by using the comments as a reference:

    * This structure maps out the global data that needs to be kept on a
    * per-cpu basis.  The members are accessed via the PCPU_GET/SET/PTR
    * macros defined in <machine/pcpu.h>.  Machine dependent fields are
    * defined in the PCPU_MD_FIELDS macro defined in <machine/pcpu.h>.
   struct pcpu {
           struct thread   *pc_curthread;          /* Current thread */
           struct thread   *pc_idlethread;         /* Idle thread */
           struct thread   *pc_fpcurthread;        /* Fp state owner */
           struct thread   *pc_deadthread;         /* Zombie thread or NULL */
           struct pcb      *pc_curpcb;             /* Current pcb */
           uint64_t        pc_switchtime;          /* cpu_ticks() at last csw */
           int             pc_switchticks;         /* `ticks' at last csw */
           u_int           pc_cpuid;               /* This cpu number */
           STAILQ_ENTRY(pcpu) pc_allcpu;
           struct lock_list_entry *pc_spinlocks;
           struct vmmeter  pc_cnt;                 /* VM stats counters */
           long            pc_cp_time[CPUSTATES];  /* statclock ticks */
           struct device   *pc_device;
           void            *pc_netisr;             /* netisr SWI cookie */
           int             pc_unused1;             /* unused field */
           int             pc_domain;              /* Memory domain. */
           struct rm_queue pc_rm_queue;            /* rmlock list of trackers */
           uintptr_t       pc_dynamic;             /* Dynamic per-cpu data area */
            * Keep MD fields last, so that CPU-specific variations on a
            * single architecture don't result in offset variations of
            * the machine-independent fields of the pcpu.  Even though
            * the pcpu structure is private to the kernel, some ports
            * (e.g., lsof, part of gtop) define _KERNEL and include this
            * header.  While strictly speaking this is wrong, there's no
            * reason not to keep the offsets of the MI fields constant
            * if only to make kernel debugging easier.
   } __aligned(CACHE_LINE_SIZE);

As the comments tell us, the machine depend fields are defined in '/usr/include/machine/pcpu.h'. I will list the PCPU_MD_FIELDS here, just to make the view of the struct complete:

   #define PCPU_MD_FIELDS                                                  \
       char    pc_monitorbuf[128] __aligned(128); /* cache line */     \
       struct  pcpu *pc_prvspace;      /* Self-reference */            \
       struct  pmap *pc_curpmap;                                       \
       struct  amd64tss *pc_tssp;      /* TSS segment active on CPU */ \
       struct  amd64tss *pc_commontssp;/* Common TSS for the CPU */    \
       uint64_t pc_kcr3;                                               \
       uint64_t pc_ucr3;                                               \
       uint64_t pc_saved_ucr3;                                         \
       register_t pc_rsp0;                                             \
       register_t pc_scratch_rsp;      /* User %rsp in syscall */      \
       register_t pc_scratch_rax;                                      \
       u_int   pc_apic_id;                                             \
       u_int   pc_acpi_id;             /* ACPI CPU id */               \
       /* Pointer to the CPU %fs descriptor */                         \
       struct user_segment_descriptor  *pc_fs32p;                      \
       /* Pointer to the CPU %gs descriptor */                         \
       struct user_segment_descriptor  *pc_gs32p;                      \
       /* Pointer to the CPU LDT descriptor */                         \
       struct system_segment_descriptor *pc_ldt;                       \
       /* Pointer to the CPU TSS descriptor */                         \
       struct system_segment_descriptor *pc_tss;                       \
       uint64_t        pc_pm_save_cnt;                                 \
       u_int   pc_cmci_mask;           /* MCx banks for CMCI */        \
       uint64_t pc_dbreg[16];          /* ddb debugging regs */        \
       uint64_t pc_pti_stack[PC_PTI_STACK_SZ];                         \
       int pc_dbreg_cmd;               /* ddb debugging reg cmd */     \
       u_int   pc_vcpu_id;             /* Xen vCPU ID */               \
       uint32_t pc_pcid_next;                                          \
       uint32_t pc_pcid_gen;                                           \
       uint32_t pc_smp_tlb_done;       /* TLB op acknowledgement */    \
       uint32_t pc_ibpb_set;                                           \
       char    __pad[96]               /* be divisor of PAGE_SIZE      \
                                          after cache alignment */

Finally we can see exactly what the instruction on line 433 means, and in a pseudo-code way, we can represent the line 433 like.

   pcpu->pc_scratch_rax = %rax

This instruction is just saving the value in %rax to PCPU space, so now we can advance to line 434-435.

   434|    movq    PCPU(KCR3),%rax
   435|    movq    %rax,%cr3

On these two lines, %rax is loaded with the value of KCR3 that was stored in PCPU space, and the %cr3 register is loaded with the value from %rax. We can represent this transfer with the following pseudo-code:

   %cr3 = pcpu->pc_kcr3

This is done to update the physical base address of page directory[2], which is stored on the control register (%cr3). To understand this better, you can take a look at how paging works here[3]. Now we can go to line 436.

   436|    jmp fast_syscall_common

Here we have a jump to the label on line 441 which is a common body to both entries declared with IDTVEC. Before we jump there, let's take a look at the instructions in between, on lines 437-440.

   438| IDTVEC(fast_syscall)
   439|    swapgs  
   440|     movq    %rax,PCPU(SCRATCH_RAX)

This code from line 437 to line 440, do almost the same thing of the code from line 430 to line 433, the only difference is in the parameter of function-like macro IDTVEC. This time the IDTVEC receives 'fast_syscall' to expand and form a function label as follow:

   .p2align 4,0x90
   .globl Xfast_syscall
   .type Xfast_syscall,@function
   441| fast_syscall_common:

Line 441 has the target label used in the jump instruction in the line 436. Now that we are here, let's take a look at the instructions listed here.

   442|    movq    %rsp,PCPU(SCRATCH_RSP)
   443|    movq    PCPU(RSP0),%rsp

Line 442 saves the user's %rsp, while line 443 loads %rsp with PCPU(RSP0). This code basically replaces the user's %rsp with a stored value. This can be represented with the following pseudo-code:

   pcpu->pc_scratch_rsp = %rsp
   %rsp = pcpu->pc_rsp0

Now we can advance a little bit in the code.

   444|    /* Now emulate a trapframe. Make the 8 byte alignment odd for call. */
   445|    subq    $TF_SIZE,%rsp

Before showing what the code above is doing, we need to know what TF_SIZE means and how this symbol is created. I started taking a look at the source file '/usr/src/sys/amd64/amd64/genassym.c' again. We can see all the TF_* symbols are defined/created like the PC_* symbols explained before in the first reference of PC_SCRATCH_RAX. The line that defines the TF_SIZE is the following

   ASSYM(TF_SIZE, sizeof(struct trapframe));

Now we know the TF_SIZE refers to the size of the struct trapframe, we can figure out how many bytes are being allocated on line 445 to 'create' the trap frame. The struct trapframe is defined at /usr/src/sys/x86/include/frame.h as:

    * Exception/Trap Stack Frame
    * The ordering of this is specifically so that we can take first 6
    * the syscall arguments directly from the beginning of the frame.
   struct trapframe {
           register_t      tf_rdi;
           register_t      tf_rsi;
           register_t      tf_rdx;
           register_t      tf_rcx;
           register_t      tf_r8;
           register_t      tf_r9;
           register_t      tf_rax;
           register_t      tf_rbx;
           register_t      tf_rbp;
           register_t      tf_r10;
           register_t      tf_r11;
           register_t      tf_r12;
           register_t      tf_r13;
           register_t      tf_r14;
           register_t      tf_r15;
           uint32_t        tf_trapno;
           uint16_t        tf_fs;
           uint16_t        tf_gs;
           register_t      tf_addr;
           uint32_t        tf_flags;
           uint16_t        tf_es;
           uint16_t        tf_ds;
           /* below portion defined in hardware */
           register_t      tf_err;
           register_t      tf_rip;
           register_t      tf_cs;
           register_t      tf_rflags;
           /* the amd64 frame always has the stack registers */
           register_t      tf_rsp;
           register_t      tf_ss;

At this point we have a stack with with a struct trapframe allocated. Let's take a look at the next few lines of code.

   446|     /* defer TF_RSP till we have a spare register */
   447|     movq    %r11,TF_RFLAGS(%rsp)
   448|     movq    %rcx,TF_RIP(%rsp)   /* %rcx original value is in %r10 */
   449|     movq    PCPU(SCRATCH_RSP),%r11  /* %r11 already saved */
   450|     movq    %r11,TF_RSP(%rsp)   /* user stack pointer */
   451|     movq    PCPU(SCRATCH_RAX),%rax 
   452|     movq    %rax,TF_RAX(%rsp)   /* syscall number */
   453|     movq    %rdx,TF_RDX(%rsp)   /* arg 3 */

All the TF_[REGNAME] symbols are defined to be offsets from a trapframe structure base. You can take look at '/usr/src/sys/amd64/amd64/genassym.c' again to see how this is done. If we consider the trapframe structure on the stack being a variable with name trapframe we can translate the code from lines 447 to 453 to the following pseudo-code:

   trapframe->tf_rflags = %r11
   trapframe->tf_rip = %rcx
   %r11 = pcpu->pc_scratch_rsp
   trapframe->tf_rsp = %r11
   %rax = pcpu->pc_scratch_rax
   trapframe->tf_rax = %rax
   trapframe->tf_rdx = %rdx

You probably can see we started to fill our stack with the arguments for the system call, now we can advance a bit.

   454|     SAVE_SEGS

The SAVE_REGS assembly macro is defined at '/usr/include/machine/asmacros.h' on line 168-173.

   .macro  SAVE_SEGS
   movw    %fs,TF_FS(%rsp)
   movw    %gs,TF_GS(%rsp)
   movw    %es,TF_ES(%rsp)
   movw    %ds,TF_DS(%rsp)

The SAVE_REGS macro is used to save the segment registers on the trapframe structure allocated previously on the stack. We can advance to line 455 now.

   455|     call    handle_ibrs_entry

The handle_ibrs_entry is defined at '/usr/src/sys/amd64/amd64/support.S' as:

   /* all callers already saved %rax, %rdx, and %rcx */
           cmpb    $0,hw_ibrs_active(%rip)
           je      1f
           movl    $MSR_IA32_SPEC_CTRL,%ecx
           orl     $(IA32_SPEC_CTRL_IBRS|IA32_SPEC_CTRL_STIBP),%eax
           orl     $(IA32_SPEC_CTRL_IBRS|IA32_SPEC_CTRL_STIBP)>>32,%edx
           movb    $1,PCPU(IBPB_SET)
           testl   $CPUID_STDEXT_SMEP,cpu_stdext_feature(%rip)
           jne     1f
           ibrs_seq 32
   1:      ret

Before we expand this macro, we need to figure out what IBRS is. I did some minimal research and learned that IBRS stands for Indirect Branch Restricted Speculation (IBRS). You can read more about it in this paper from Intel about Speculative Execution Side Channels[4]. We can sumarize by saying that this is a Branch Target Injection Mitigation. You can read the topic 3.2 of paper to get a more in depth look at this. The following code is the handle_ibrs_entry after expanded in support.S

   $ cc -E /usr/src/sys/amd64/amd64/support.S
   .p2align 4,0x90; .globl handle_ibrs_entry; .type handle_ibrs_entry,@function; handle_ibrs_entry:
    cmpb $0,hw_ibrs_active(%rip)
    je 1f
    movl $0x048,%ecx
    orl $(0x00000001|0x00000002),%eax
    orl $(0x00000001|0x00000002)>>32,%edx
    movb $1,PCPU(IBPB_SET)
    testl $0x00000080,cpu_stdext_feature(%rip)
    jne 1f
    ibrs_seq 32
   1: ret
   .size handle_ibrs_entry, . - handle_ibrs_entry

Here we get operations with model specific registers (MSR) again and we can see this function called on line 455 tests for IBRS capabilities and mitigates the type of atacks quoted in the Intel paper referenced before. As we can see, there are a lot of instructions being executed on every syscall. This overhead is why a lot of people are still warning about Spectre/Meltdown mitigations causing performance impact. We can try to lessen the impact by setting the IPPB on PCPU space, as mentioned in a Redhat post about Spectre/Meltdown performance and mitigations[5]. If you want to learn more about mitigations for speculative execution, check out the resources at the end of this write up. Now we can proceed in code.

   |456     movq    PCPU(CURPCB),%r11
   |457     andl    $~PCB_FULL_IRET,PCB_FLAGS(%r11)

Before we can deal with this code directly, we need to figure out what a PCB means in this context. The PCB is the Process Control Block, which is a structure of the operating system for each of it's active processes, to keep resource management information, administrative information and an execution snapshot[6]. On our system, this structure is defined in '/usr/src/sys/amd64/include/pcb.h' and contains:

    * NB: The fields marked with (*) are used by kernel debuggers.  Their
    * ABI should be preserved.
   struct pcb {
           register_t      pcb_r15;        /* (*) */
           register_t      pcb_r14;        /* (*) */
           register_t      pcb_r13;        /* (*) */
           register_t      pcb_r12;        /* (*) */
           register_t      pcb_rbp;        /* (*) */
           register_t      pcb_rsp;        /* (*) */
           register_t      pcb_rbx;        /* (*) */
           register_t      pcb_rip;        /* (*) */
           register_t      pcb_fsbase;
           register_t      pcb_gsbase;
           register_t      pcb_kgsbase;
           register_t      pcb_cr0;
           register_t      pcb_cr2;
           register_t      pcb_cr3;
           register_t      pcb_cr4;
           register_t      pcb_dr0;
           register_t      pcb_dr1;
           register_t      pcb_dr2;
           register_t      pcb_dr3;
           register_t      pcb_dr6;
           register_t      pcb_dr7;
           struct region_descriptor pcb_gdt;
           struct region_descriptor pcb_idt;
           struct region_descriptor pcb_ldt;
           uint16_t        pcb_tr;
           u_int           pcb_flags;
   #define PCB_FULL_IRET   0x01    /* full iret is required */
   #define PCB_DBREGS      0x02    /* process using debug registers */
   #define PCB_KERNFPU     0x04    /* kernel uses fpu */
   #define PCB_FPUINITDONE 0x08    /* fpu state is initialized */
   #define PCB_USERFPUINITDONE 0x10 /* fpu user state is initialized */
   #define PCB_32BIT       0x40    /* process has 32 bit context (segs etc) */
   #define PCB_FPUNOSAVE   0x80    /* no save area for current FPU ctx */
           uint16_t        pcb_initial_fpucw;
           /* copyin/out fault recovery */
           caddr_t         pcb_onfault;
           uint64_t        pcb_saved_ucr3;
           /* local tss, with i/o bitmap; NULL for common */
           struct amd64tss *pcb_tssp;
           /* model specific registers */
           register_t      pcb_efer;
           register_t      pcb_star;
           register_t      pcb_lstar;
           register_t      pcb_cstar;
           register_t      pcb_sfmask;
           struct savefpu  *pcb_save;
           uint64_t        pcb_pad[5];

All PCB_* symbols are created in the same way as PC_* and TF_*. Now that we know what this struct is, we can figure what the code on line 456 and 457 is doing. We can represent with the following pseudo-code:

   %r11 = pcpu->pc_curpcb  
   pcb->pcb_flags = (NOT(PCB_FULL_IRET)) AND pcb->pcb_flags

The code above just loads a pointer to current PCB on register %r11 and updates PCB_FLAGS using a simple operation with PCB_FULL_IRET value defined at '/usr/src/sys/amd64/include/pcb.h'. This is done to turn off the full context restore. Now we can advance to line 458.

   458|     sti

The sti instruction sets the 9th bit of RFLAGS register. This bit is the interrupt flag (IF). The IF flag controls the servicing of hardware-generated interrupts (those received at the processor’s INTR pin). If the IF flag is set, the processor deals with whatever hardware interrupt was triggered. If the IF flag is clear, hardware interrupts are masked[1]. Let's take a look at the next two instructions now.

   459|     movq    $KUDSEL,TF_SS(%rsp)
   460|     movq    $KUCSEL,TF_CS(%rsp)

Both KUDSEL and KUDCSEL are defined at '/sys/amd64/amd64/genassym.c', but none of these are defined with direct values like the others defined above. They are defined as:


The GSEL function-like macro and arguments are all defined at '/usr/src/sys/x86/include/segments.h' as:

    * Selectors
   #define SEL_UPL         3               /* user priority level */
   #define GSEL(s,r)       (((s)<<3) | r)  /* a global selector */
    * Entries in the Global Descriptor Table (GDT)
   #define GUDATA_SEL      7       /* User 32/64 bit Data Descriptor */
   #define GUCODE_SEL      8       /* User 64 bit Code Descriptor */

Now we can understand what is happening on lines 459 and 460. We are loading the stack selector and code selector members of trapframe structure with GDT[7] user entries and user priority level. Now we can advance more on code.

   461|     movq    $2,TF_ERR(%rsp)
   462|     movq    %rdi,TF_RDI(%rsp)   /* arg 1 */
   463|     movq    %rsi,TF_RSI(%rsp)   /* arg 2 */
   464|     movq    %r10,TF_RCX(%rsp)   /* arg 4 */
   465|     movq    %r8,TF_R8(%rsp)     /* arg 5 */
   466|     movq    %r9,TF_R9(%rsp)     /* arg 6 */
   467|     movq    %rbx,TF_RBX(%rsp)   /* C preserved */
   468|     movq    %rbp,TF_RBP(%rsp)   /* C preserved */
   469|     movq    %r12,TF_R12(%rsp)   /* C preserved */
   470|     movq    %r13,TF_R13(%rsp)   /* C preserved */
   471|     movq    %r14,TF_R14(%rsp)   /* C preserved */
   472|     movq    %r15,TF_R15(%rsp)   /* C preserved */
   473|     movl    $TF_HASSEGS,TF_FLAGS(%rsp)

As we can see, from line 461 to 473 we are just filling our trapframe with the parameters for syscall and storing some registers that C needs to restore along 'frames'. How do we know where each parameter fits and which registers C needs to save? The answer is the System V ABI[8]. The last line here, TF_HASSEGS, is defined at '/usr/src/sys/x86/include/frame.h' as

   #define TF_HASSEGS      0x1

Now that we have filled the trapframe, and set the flag to show everything is where it needs to be, we can look at the next line of code.

   474|     FAKE_MCOUNT(TF_RIP(%rsp))

The FAKE_MCOUNT function-like macro are defined at '/usr/src/sys/amd64/include/asmacros.h' as:

   #define FAKE_MCOUNT(caller)     pushq caller ; call __mcount ; popq %rcx  

So on line 474, we are pushing our return address saved before on the trapframe to the top of the stack. Then calling the __mcount function, and popping the return address to the %rcx register. The __mcount function is defined at '/usr/src/sys/amd64/amd64/prof_machdep.c' as:

   __asm("                                                         \n\
   GM_STATE        =       0                                       \n\
   GMON_PROF_OFF   =       3                                       \n\
           .text                                                   \n\
           .p2align 4,0x90                                         \n\
           .globl  __mcount                                        \n\
           .type   __mcount,@function                              \n\
   __mcount:                                                       \n\
           #                                                       \n\
           # Check that we are profiling.  Do it early for speed.  \n\
           #                                                       \n\
           cmpl    $GMON_PROF_OFF,_gmonparam+GM_STATE              \n\
           je      .mcount_exit                                    \n\
           #                                                       \n\
           # __mcount is the same as [.]mcount except the caller   \n\
           # hasn't changed the stack except to call here, so the  \n\
           # caller's raddr is above our raddr.                    \n\

The FAKE_MCOUNT is a macro to handle the case of the code being profiled[9]. If we assume that profiling is OFF, then the .mcount_exit function get called, and we reach the end of __asm directive.

   .mcount_exit:                                                   \n\
           ret     $0                                              \n\

If we are not profiling, the only thing the FAKE_MCOUNT does is to load the %rcx register with the value of tf_rip member in the trapframe. Now we can move on to the function that deals with system calls in amd64.

   475|     movq    PCPU(CURTHREAD),%rdi
   476|     movq    %rsp,TD_FRAME(%rdi)
   477|     movl    TF_RFLAGS(%rsp),%esi
   478|     andl    $PSL_T,%esi
   479|     call    amd64_syscall

The amd64_syscall function is defined at /usr/src/sys/amd64/amd64/trap.c with the following signature:

   void amd64_syscall(struct thread *td, int traced)

We can see here how the parameters for the function are constructed, but we dont know how a thread struct is defined. We can find the definition for this in '/usr/src/sys/sys/proc.h':

   struct thread {
           struct mtx      *volatile td_lock; /* replaces sched lock */
           struct proc     *td_proc;       /* (*) Associated process. */
           TAILQ_ENTRY(thread) td_plist;   /* (*) All threads in this proc. */
           TAILQ_ENTRY(thread) td_runq;    /* (t) Run queue. */
           TAILQ_ENTRY(thread) td_slpq;    /* (t) Sleep queue. */
           struct trapframe *td_frame;     /* (k) */

The TD_* symbols are defined in the same way as PC_* and TF_*, but they are all offsets of members of the thread structure. The only thing we need to know now is figure out what PSL is and where PSL_T is defined. We have now reached the file '/usr/src/sys/x86/include/psl.h', that defines the PSL_T and gives an explanation of the processor status longword (PSL):

   * 386 processor status longword.
   #define PSL_C           0x00000001      /* carry bit */
   #define PSL_PF          0x00000004      /* parity bit */
   #define PSL_AF          0x00000010      /* bcd carry bit */
   #define PSL_Z           0x00000040      /* zero bit */
   #define PSL_N           0x00000080      /* negative bit */
   #define PSL_T           0x00000100      /* trace enable bit */
   #define PSL_I           0x00000200      /* interrupt enable bit */
   #define PSL_D           0x00000400      /* string instruction direction bit */
   #define PSL_V           0x00000800      /* overflow bit */
   #define PSL_IOPL        0x00003000      /* i/o privilege level */
   #define PSL_NT          0x00004000      /* nested task bit */
   #define PSL_RF          0x00010000      /* resume flag bit */
   #define PSL_VM          0x00020000      /* virtual 8086 mode bit */
   #define PSL_AC          0x00040000      /* alignment checking */
   #define PSL_VIF         0x00080000      /* virtual interrupt enable */
   #define PSL_VIP         0x00100000      /* virtual interrupt pending */
   #define PSL_ID          0x00200000      /* identification bit */

Now we have all we need to understand the function call. If we expand all the macros we will get:

   movq    gs:PC_CURTHREAD,%rdi    ;   %rdi = pcpu->pc_curthread;
   movq    %rsp,TD_FRAME(%rdi)     ;   pcpu->pc_curthread->td_frame = trapframe
   movl    TF_RFLAGS(%rsp),%esi    ;   %esi = trapframe->tf_rflags
   andl    $PSL_T,%esi             ;   %esi = %esi && 0x00000100
   call    amd64_syscall           ;   amd64_syscall(pcpu->pc_curthread,trapframe->tf_rflags && 0x00000100);

As we can see the amd64_syscall function is called with the first argument being a pointer to the current thread stored on PCPU space, and second being an AND(&) between the flags stored in the trapframe on the stack and the 0x00000100 value which sets the trace enable bit. Now we should take a look at amd64_syscall implementation from line 982 to 981:

       int error;
       ksiginfo_t ksi;
   #ifdef DIAGNOSTIC
       if (!TRAPF_USERMODE(td->td_frame)) {
           /* NOT REACHED */
       error = syscallenter(td);

The TRAPF_USERMODE function-like macro is defined at '/usr/src/sys/amd64/include/cpu.h' as

   #define TRAPF_USERMODE(framep) \
           (ISPL((framep)->tf_cs) == SEL_UPL)

The macros inside TRAPF_USERMODE macro are defined in the file '/usr/src/sys/x86/include/segments.h'. The macro is doing just what it's name describes, testing if the trapframe has the user priority level. If not, then make a call to panic(9). We won't be getting into the details of the panic function, but you can take a look at panic man page or study it's implementation starting with it's definition at '/usr/src/sys/kern/kern_shutdown.c'.

If DIAGNOSTIC is not defined, or if the panic function is not triggered, we go to the syscallenter function, passing the current thread stored on PCPU space. The syscallenter function is defined in '/usr/src/sys/kern/subr_syscall.c'. Let's look at syscallenter code:

   struct proc *p;             //Process struct pointer
   struct syscall_args *sa;    //Syscall arguments struct pointer 
   int error, traced;
   p = td->td_proc;            
   sa = &td->td_sa;

Both structs proc and thread are defined at '/usr/src/sys/sys/proc.h', and the syscall_args are defined at '/usr/src/sys/amd64/include/proc.h' and PCPU_INC are defined at '/usr/src/sys/amd64/include/pcpu.h'. We can continue in the syscallenter code:

   td->td_pticks = 0;                               /* (t) Statclock hits for profiling */
   if (td->td_cowgen != p->p_cowgen)                /* study on copy on write */
   traced = (p->p_flag & P_TRACED) != 0;           /* debug purpose
   if (traced || td->td_dbgflags & TDB_USERWR) {       ...
           td->td_dbgflags &= ~TDB_USERWR;
           if (traced)
                   td->td_dbgflags |= TDB_SCE;
           PROC_UNLOCK(p);                         */
   error = (p->p_sysent->sv_fetch_syscall_args)(td); /* fetching syscall arguments */

To figure out how we are fetching these arguments we need to take a look at some structs to see how these structs are filled, starting with struct proc:

   struct proc {
       struct sysentvec *p_sysent;     /* (b) Syscall dispatch info. *

Now we should take a look at struct sysentvec, which is defined at the file '/usr/src/sys/sys/sysent.h'

   struct sysentvec {
       int             (*sv_fetch_syscall_args)(struct thread *);

We can see what kind of function we are calling, but we need to know how the sysentvec structure is filled to figure out which functions are really being called. The definitions used to fill our structure are listed in the file '/usr/src/sys/amd64/amd64/elf_machdep.c' as:

   struct sysentvec elf64_freebsd_sysvec = {
           .sv_size        = SYS_MAXSYSCALL,
           .sv_table       = sysent,
           .sv_mask        = 0,
           .sv_errsize     = 0,
           .sv_errtbl      = NULL,
           .sv_transtrap   = NULL,
           .sv_fixup       = __elfN(freebsd_fixup),
           .sv_sendsig     = sendsig,
           .sv_sigcode     = sigcode,
           .sv_szsigcode   = &szsigcode,
           .sv_name        = "FreeBSD ELF64",
           .sv_coredump    = __elfN(coredump),
           .sv_imgact_try  = NULL,
           .sv_minsigstksz = MINSIGSTKSZ,
           .sv_pagesize    = PAGE_SIZE,
           .sv_minuser     = VM_MIN_ADDRESS,
           .sv_maxuser     = VM_MAXUSER_ADDRESS,
           .sv_usrstack    = USRSTACK,
           .sv_psstrings   = PS_STRINGS,
           .sv_stackprot   = VM_PROT_ALL,
           .sv_copyout_strings     = exec_copyout_strings,
           .sv_setregs     = exec_setregs,
           .sv_fixlimit    = NULL,
           .sv_maxssiz     = NULL,
           .sv_flags       = SV_ABI_FREEBSD | SV_LP64 | SV_SHP | SV_TIMEKEEP,
           .sv_set_syscall_retval = cpu_set_syscall_retval,
           .sv_fetch_syscall_args = cpu_fetch_syscall_args,
           .sv_syscallnames = syscallnames,
           .sv_shared_page_base = SHAREDPAGE,
           .sv_shared_page_len = PAGE_SIZE,
           .sv_schedtail   = NULL,
           .sv_thread_detach = NULL,
           .sv_trap        = NULL,
   INIT_SYSENTVEC(elf64_sysvec, &elf64_freebsd_sysvec);

So at this point, we know we are calling cpu_fetch_syscall_args, and passing the thread structure parameter again. This function is defined in the file '/usr/src/sys/amd64/amd64/trap.c'. Here we presented with the main parts of the cpu_fetch_syscalls_args:

   cpu_fetch_syscall_args(struct thread *td)
       struct trapframe *frame;
       register_t *argp;
       struct syscall_args *sa;
       int reg, regcnt, error;
       /* Setting the register number to copy to our syscalls_args struct */
       regcnt = 6;
       /* Loading a pointer to our trapframe */
       frame = td->td_frame;
       /* making sa point to our thread syscall_args structure */
       sa = &td->td_sa;
       */ Loading the syscall number from our trapframe to our syscall_args
       sa->code = frame->tf_rax; 
       /* Here we did a test for the syscall code index dont cross our syscall
          syscall table size */    
       if (sa->code >= p->p_sysent->sv_size)
               /* If the code is invalid we point to value of SYS_syscall
               sa->callp = &p->p_sysent->sv_table[0];
               /* Load the callp with the sysent indexed by sa->code
               sa->callp = &p->p_sysent->sv_table[sa->code];
       /* Load the number of arguments on out syscall_args struct */        
       sa->narg = sa->callp->sy_narg;
       /* Making arpg point to begin of our trapframe struct */
       argp = &frame->tf_rdi;
       /* Copy our parameters saved on trapframe to our syscall_args->args
       bcopy(argp, sa->args, sizeof(sa->args[0]) * regcnt);

There is more to this function, but I highlighted the normal path to load the system call number, number of arguments, and 6 parameters from registers saved on the trapframe following the ABI sequence. Now we should take a look at sysent to better understand the code above. First, we will take a look at the sysent struct, which is defined at '/usr/src/sys/sys/sysent.h' as:

   struct sysent {                 /* system call table */
           int     sy_narg;        /* number of arguments */
           sy_call_t *sy_call;     /* implementing function */
           au_event_t sy_auevent;  /* audit event associated with syscall */
           systrace_args_func_t sy_systrace_args_func;
                                   /* optional argument conversion function. */
           u_int32_t sy_entry;     /* DTrace entry ID for systrace. */
           u_int32_t sy_return;    /* DTrace return ID for systrace. */
           u_int32_t sy_flags;     /* General flags for system calls. */
           u_int32_t sy_thrcnt;

The sysent struct table is filled in the file '/usr/src/sys/kern/init_sysent.c' with entries like:

   /* The casts are bogus but will do for now. */
   struct sysent sysent[] = {
       { 0, (sy_call_t *)nosys, AUE_NULL, NULL, 0, 0, 0, SY_THR_STATIC },
       { AS(sys_exit_args), (sy_call_t *)sys_sys_exit, AUE_EXIT, NULL, 0, 0,
       { 0, (sy_call_t *)sys_fork, AUE_FORK, NULL, 0, 0, SYF_CAPENABLED,
           SY_THR_STATIC },      /* 2 = fork */

Now we should take a look on struct syscall_args again:

   struct syscall_args {
       u_int code;
       struct sysent *callp;
       register_t args[8];
       int narg;

We have learned that our syscall_args have a pointer to a struct sysent named 'callp', and we know that the struct sysent has a member which is a function pointer that points to the function which implements the syscall being issued. The way that this pointer is filled is described above in the description of the function cpu_fetch_syscall_args. Let's go back to code of function syscallenter. As we can see, we have a lot of macros asserting for kernel trace. We are not going into the kernel trace, but you can take a look at the ktr(4), ktr(9), ktrace(1), ktrace(2) man pages and in the file /usr/src/sys/sys/ktr.h to learn more. We going to reproduce the function after the pre processing to see what is being executed:

   $ cc -E /usr/src/sys/kern/subr_syscall.c
   syscallenter(struct thread *td)
       error = (p->p_sysent->sv_fetch_syscall_args)(td);
       /* We are returning here after fetch the parameters */
       if (error == 0) {
           /*  Here we are stoping our process and generating a event which
            *  can be checked on kernel log buffers, if you want to get more
            *  in depth you should take a look at /usr/src/sys/sys/proc.h to
            *  understand the macros and at /usr/src/sys/kern/sys_process.c to
            *  figure out how stopevent function do this.
           STOPEVENT(p, S_SCE, sa->narg);
           if (p->p_flag & P_TRACED) {
               if (p->p_ptevents & PTRACE_SCE)
                   ptracestop((td), 5, ((void *)0));
           /* Fetching the parameters again if debugger changed something */
           if (td->td_dbgflags & TDB_USERWR) {
               error = (p->p_sysent->sv_fetch_syscall_args)(td);
               if (error != 0)
               goto retval;
           /* function defined at /usr/src/sys/kern/kern_syscalls.c */ 
           error = syscall_thread_enter(td, sa->callp);
           if (error != 0)
           goto retval;
           /*  As we know callp is our sysent struct and the sy_call member is
            *  the implementing function, then we are finally calling the 
            *  function which implements the system call issued
           error = (sa->callp->sy_call)(td, sa->args);
           if ((td->td_pflags & TDP_NERRNO) == 0)
           td->td_errno = error;
           /* function defined at /usr/src/sys/kern/kern_syscalls.c */
           syscall_thread_exit(td, sa->callp);
       /* Debug purpose */
       if (traced) {
           td->td_dbgflags &= ~TDB_SCE;
       /* As we can see in our sysentvec being filled at 
        * /usr/src/sys/amd64/amd64/elf_machdep.c:
        *  .sv_set_syscall_retval = cpu_set_syscall_retval
        *  the real function being called here is cpu_set_syscall_retval and
        *  we gonna show this function now.
       (p->p_sysent->sv_set_syscall_retval)(td, error);
       return (error);

The function cpu_set_syscall_retval is defined at '/usr/src/sys/amd64/amd64/vm_machdep.c' the code is printed here for reference:

   cpu_set_syscall_retval(struct thread *td, int error)
           switch (error) {
           /* In the normal path, we just copy the return values of the system
              call to our trapframe
           case 0: 
                   td->td_frame->tf_rax = td->td_retval[0];
                   td->td_frame->tf_rdx = td->td_retval[1];
                   td->td_frame->tf_rflags &= ~PSL_C;
           case ERESTART:
                    * Reconstruct pc, we know that 'syscall' is 2 bytes,
                    * lcall $X,y is 7 bytes, int 0x80 is 2 bytes.
                    * We saved this in tf_err.
                    * %r10 (which was holding the value of %rcx) is restored
                    * for the next iteration.
                    * %r10 restore is only required for freebsd/amd64 processes,
                    * but shall be innocent for any ia32 ABI.
                    * Require full context restore to get the arguments
                    * in the registers reloaded at return to usermode.
                   td->td_frame->tf_rip -= td->td_frame->tf_err;
                   td->td_frame->tf_r10 = td->td_frame->tf_rcx;
                   set_pcb_flags(td->td_pcb, PCB_FULL_IRET);
           case EJUSTRETURN:
                   td->td_frame->tf_rax = SV_ABI_ERRNO(td->td_proc, error);
                   td->td_frame->tf_rflags |= PSL_C;

Now our syscall has been executed, our return values have been stored on our trapframe, and we have reached the end of syscallenter function. Now we continue through through amd64_syscall code, and we should expand the macros to avoid reading useless asserts to understand the syscall system. We will preserve some macros which are useful to understanding the code.

   $ cc -E /usr/src/sys/amd64/amd64/trap.c
   amd64_syscall(struct thread *td, int traced)
       int error;
       /* Defined at /usr/src/sys/sys/signalvar.h */
       ksiginfo_t ksi;
       error = syscallenter(td);
       /* We are returning here after reach the end of syscallenter function */
       /* This macro expand to a __builtin_expect used by compilers[10] to do
        * some kind of branch prediction regardless the value of parameter, in
        * this case to predict we are not tracing the syscall on most of time
       if (__predict_false(traced)) {
           td->td_frame->tf_rflags &= ~PSL_T;
           ksi.ksi_signo = SIGTRAP;
           ksi.ksi_code = TRAP_TRACE;
           ksi.ksi_addr = (void *)td->td_frame->tf_rip;
           /* Defined at /usr/src/sys/kern/kern_sig.c */
           trapsignal(td, &ksi);
       /* Kernel asserts expanding to funny code */
       do { } while (0);
       do { } while (0);
       do { } while (0);
       /* The rest of function is self explained and we will take a look on
        * on syscallret function properly in the following text
       syscallret(td, error);
        * If the user-supplied value of %rip is not a canonical
        * address, then some CPUs will trigger a ring 0 #GP during
        * the sysret instruction.  However, the fault handler would
        * execute in ring 0 with the user's %gs and %rsp which would
        * not be safe.  Instead, use the full return path which
        * catches the problem safely.
       if (__predict_false(td->td_frame->tf_rip >= VM_MAXUSER_ADDRESS))
               set_pcb_flags(td->td_pcb, PCB_FULL_IRET);

The syscallret function is defined at '/usr/src/sys/kern/subr_syscall.c' and is presented here:

   syscallret(struct thread *td, int error)
           struct proc *p, *p2;
           struct syscall_args *sa;
           ksiginfo_t ksi;
           int traced, error1;
           KASSERT((td->td_pflags & TDP_FORKING) == 0,
               ("fork() did not clear TDP_FORKING upon completion"));
            * Handle reschedule and other end-of-syscall issues
           userret(td, td->td_frame);

I'm only showing the main part of the syscallret function. Here, we call the userret function to handle some end-of-syscall issues. The userret function is self documented and is defined at '/usr/src/sys/kern/subr_trap.c'. Inside the userret function, we have a call to sched_userret, which is defined at '/usr/src/sys/kern/sched_4bsd.c' and '/usr/src/sys/kern/sched_4bsd.c'. Let's return to the amd64_syscall code.

   amd64_syscall(struct thread *td, int traced)
           syscallret(td, error);
           /* We are returning here after reach the end of function syscallret
            * and now we are finally reaching the end of function 
            * amd64_syscall, the following lines is just to handle the 
            * CVE-2012-0217[11] issue
            * If the user-supplied value of %rip is not a canonical
            * address, then some CPUs will trigger a ring 0 #GP during
            * the sysret instruction.  However, the fault handler would
            * execute in ring 0 with the user's %gs and %rsp which would
            * not be safe.  Instead, use the full return path which
            * catches the problem safely.
           if (__predict_false(td->td_frame->tf_rip >= VM_MAXUSER_ADDRESS))
                   set_pcb_flags(td->td_pcb, PCB_FULL_IRET);

Now we have reached the end of function amd64_syscall and we are going back to to '/usr/src/sys/amd64/amd64/exception.S' after the call:

   480| 1:  movq    PCPU(CURPCB),%rax
   481|     /* Disable interrupts before testing PCB_FULL_IRET. */
   482|     cli
   483|     testl   $PCB_FULL_IRET,PCB_FLAGS(%rax)
   484|     jnz 4f

The lines above load a pointer to the current PCB on %rax register, cleaning the interrupt bit and testing if we need to do a full context restore instead of using the sysret. This explaned before on function amd64_syscall[11]. We are implying here we dont need full context restore and proceed with code.

   485|     /* Check for and handle AST's on return to userland. */
   486|     movq    PCPU(CURTHREAD),%rax
   487|     testl   $TDF_ASTPENDING | TDF_NEEDRESCHED,TD_FLAGS(%rax)
   488|     jne 3f
   489|     call    handle_ibrs_exit

Here we are loading a pointer to the current thread in the %rax register, and doing a test to see if we need to handle AST. I had never seen this abbreviation before, so I took a look at the called function to see if the branch is taken. The function ast() is defined at 'usr/src/sys/kern/subr_trap.c'. We can figure out what is done just reading the comment before the function:

   "Process an asynchronous software trap"

We are implying here that the branch is not taken, and finally we have a call to handle_ibrs_exit, which is defined at '/usr/src/sys/amd64/amd64/support.S'. This is done to avoid problems addressed in the Intel paper[4]. Now we proceed to the final part of the code, that handles the SYSCALL

   490|     /* Restore preserved registers. */
   491|     MEXITCOUNT

The MEXITCOUNT macro is defined at '/usr/include/machine/asmacros.h'. It expands to a call to .mexitcount. You sould take another look at '/usr/src/sys/amd64/amd64/prof_machdep.c' to figure what this call does. We are not commenting on it because we not going in depth into profiling[9]

   492|     movq    TF_RDI(%rsp),%rdi   /* bonus; preserve arg 1 */
   493|     movq    TF_RSI(%rsp),%rsi   /* bonus: preserve arg 2 */
   494|     movq    TF_RDX(%rsp),%rdx   /* return value 2 */
   495|     movq    TF_RAX(%rsp),%rax   /* return value 1 */
   496|     movq    TF_RFLAGS(%rsp),%r11    /* original %rflags */
   497|     movq    TF_RIP(%rsp),%rcx   /* original %rip */
   498|     movq    TF_RSP(%rsp),%rsp   /* user stack pointer */

The lines above are self explanatory. Here we are just restoring our registers saved in the trapframe back to our registers before we can call the SYSRET

   499|     cmpb    $0,pti
   500|     je  2f
   501|     movq    PCPU(UCR3),%r9
   502|     movq    %r9,%cr3
   503|     xorl    %r9d,%r9d

Here we have a test to see if reached this point through Xfast_syscall_pti or through Xfast_syscall. If we reached through the *pti label, we need to restore our control register (%cr3)

   504| 2:  swapgs
   505|     sysretq

The swapgs instruction is now explained. We have finally reached the instruction which will send us to user-mode. We are going to finish the text the way we began with a description extracted from Intel Manual[1], but for SYSRET now:

   SYSRET is a companion instruction to the SYSCALL instruction. It returns
   from an OS system-call handler to user code at privilege level 3. It does
   so by loading RIP from RCX and loading RFLAGS from R11.1 With a 64-bit 
   operand size, SYSRET remains in 64-bit mode; otherwise, it enters 
   compatibility mode and only the low 32 bits of the registers are loaded.
   SYSRET loads the CS and SS selectors with values derived from bits 63:48 of
   the IA32_STAR MSR. However, the CS and SS descriptor caches are not loaded 
   from the descriptors (in GDT or LDT) referenced by those selectors. Instead 
   the descriptor caches are loaded with fixed values. See the Operation 
   section for details. It is the responsibility of OS software to ensure that
   the descriptors (in GDT or LDT) referenced by those selector values 
   correspond to the fixed values loaded into the descriptor caches; the 
   SYSRET instruction does not ensure this correspondence.

Thats all folks, if you are in doubt dont ignore the references please.


  1. Jump up to: 1.0 1.1 1.2 1.3
  2. Jump up
  3. Jump up
  4. Jump up to: 4.0 4.1
  5. Jump up
  6. Jump up
  7. Jump up
  8. Jump up
  9. Jump up to: 9.0 9.1
  10. Jump up
  11. Jump up to: 11.0 11.1


   + #security
   +  #1984