当前位置:网站首页>Guo Jian: analysis of process switching -- TLB processing

Guo Jian: analysis of process switching -- TLB processing

2020-12-06 11:14:02 osc_ d7or6cwg

One 、 Preface

Process switching is a complex process , This article does not intend to describe in detail all aspects of process switching , It's about a little bit of knowledge in process switching :TLB To deal with . In order to be able to make this clear , We describe in the second chapter in the single CPU Some and TLB Relevant details , The third chapter advances to the multi-core scenario , thus , The theory part ends . In chapters two and three , We start from a basic logic point of view , It's not limited to specific CPU And specific OS, Here we need you to understand the basic TLB We have some understanding of the organization principle of , Specific can refer to the site of 《TLB operation 》 One article . No matter how good the logic is, it needs to be reflected in HW block and SW block In the design of , In the fourth chapter , We have given linux4.4.6 Kernel in ARM64 On the platform TLB Code processing details ( Description tlb lazy mode When we introduce part of x86 Architecture code ), Hope to be able to pass concrete code and actual CPU Hardware behavior enhances understanding of the principles .

Two 、 How a single core scenario works

1、block diagram

Let's start with a single core scenario , The logic associated with process switching block Sketch Map :

CPU Several user space processes and kernel threads are running on , To speed up performance ,CPU In the design of TLB and Cache In this way HW block.Cache For faster access main memory Data and instructions in , and TLB It is to cache part of the page table content to... For faster address translation Translation lookasid buffer in , Avoid from main memory The process of accessing the page table .

If you don't do anything about it , So in the process A Switch to process B When ,TLB and Cache At the same time A and B Process data . about kernel space It doesn't matter , Because all processes are shared , But for A and B process , Each of them has its own independent user address space , in other words , The same virtual address X, stay A Can be translated into Pa, And in the B Address space will be translated into Pb, If in the process of address translation ,TLB At the same time A and B Process data , So old A Cache entries in address space affect B Translation of process address space , therefore , When the process switches , Need to have tlb The operation of , In order to remove the influence of the old process , How to do it ? Let's talk about it one by one .

2、 Absolutely no problem , But poor performance solutions

When a process switch occurs in the system , From process A Switch to process B, This leads to the address space from A Switch to B, Now , We can think that in A During the execution of a process , all TLB and Cache All the data are for A Process , Once you switch to B, The whole address space is different , So it needs to be all flush fall ( Be careful : I used linux Kernel terms ,flush It means that you will TLB perhaps cache The entry in is set to invalid , For one ARM Embedded engineers on the platform , In general, we are more used to using invalidate This term , No matter what , In this paper ,flush be equal to invalidate).

Of course, there is no problem with this scheme , When a process B When cut into execution , It's facing CPU It's a clean one , Hardware environment from scratch ,TLB and Cache There won't be any residue in A Process data to affect the current B Process execution . Of course , A little bit of a pity is that B When the process starts to execute ,TLB and Cache It's all cold ( empty ), therefore ,B When the process starts to execute ,TLB miss and Cache miss It's very serious , This leads to performance degradation .

3、 How to improve TLB Performance of ?

The optimization of a module often requires a more detailed analysis of the characteristics of the module 、 classified , Previous section , We use the term process address space , In fact, it can be further subdivided into kernel address space and user address space . For all the processes ( Including kernel threads ), The kernel address space is the same , So for this part of the address translation , No matter how the process switches , The relationship between the kernel address space and the physical address is always the same , In fact, in the process A Switch to B When , Unwanted flush fall , because B The process can also continue to use this part of TLB Content ( Above picture , Orange block). For user address space , Each process has its own address space , In the process A Switch to B When ,TLB Sum in A Process related entry( Above picture , cyan block) about B It doesn't make any sense at all , need flush fall .

Under the guidance of this idea , We actually need to distinguish global and local( In fact, that is process-specific It means ) These two types of address translation , therefore , There is often one in the page table descriptor bit To identify the address global still local Of , alike , stay TLB in , This logo global still local Of flag It will also be cached . With this design , We can... According to different scenarios flush all Or just flush local tlb entry.

4、 Consideration of special circumstances

Let's consider the following scenario : process A Switch to kernel thread K after , In fact, there is no need to switch the address space , Threads K What you can access are the addresses in kernel space , And these are addresses and processes A Shared . Since there is no switching address space , Then there's no need to flush Those process specific tlb entry 了 , When from K Switching meeting A After the process , So all TLB All the data are valid , It's greatly reduced from tlb miss. Besides , For multithreaded environments , Switching can happen between two threads in a process , Now , Threads in the same address space , There's no need to flush tlb.

4、 Further improve TLB Performance of

It is possible to further improve TLB Performance of ? Is it possible not to flush TLB?

Certainly. , But it requires us to design TLB block You need to identify process specific Of tlb entry, in other words ,TLB block Need to be aware of the address space of each process . In order to complete this design , We need to identify different address space, It's called a term ASID(address space ID). original TLB The search is through a virtual address VA To determine whether or not TLB hit. With ASID After the support of ,TLB hit The standard of judgment of is amended as follows ( Virtual address +ASID),ASID Each process is assigned a , Identify your own process address space .TLB block How to know a tlb entry Of ASID Well ? It usually comes from CPU The system register of ( about ARM64 platform , It comes from TTBRx_EL1 register ), In this way TLB block In cache (VA-PA-Global flag) At the same time , That is to say, the current ASID Cache in the corresponding TLB entry in , Such a TLB entry It includes (VA-PA-Global flag-ASID).

With ASID After the support of ,A The process switches to B The process no longer needs flush tlb 了 , because A Process execution is cached in TLB The residue in A Address space related entry It won't affect B process , although A and B There may be the same VA, however ASID It ensures that the hardware can be distinguished A and B Process address space .

3、 ... and 、 Multicore TLB operation

1、block diagram

After completing the analysis in the single core scenario , Let's take a look at multicore . Process switching related TLB Logic block The schematic diagram is as follows :

In a multicore system , When the process switches ,TLB It's a little more complicated , There are two main reasons : One is the individual cpu core They have their own TLB, therefore TLB Can be divided into two categories , One is flush all, About to all cpu core Upper tlb flush fall , Another kind of operation is flush local tlb, It's just flush Ben cpu core Of tlb. Another reason is that processes can be scheduled to any one of them cpu core On the implementation ( Of course, concrete and cpu affinity It's related to the settings of ), Which leads to task Be merciful everywhere ( In all cpu There's a remnant on it tlb entry).

2、TLB Basic thinking of operation

According to the description in the previous section , We have learned that address translation has global( Each process shares ) and local( Process specific ) The concept of , thus tlb entry Also have global and local The distinction between . If you don't distinguish between the two concepts , So when the process switches , direct flush The cpu All the remnants on . such , When a process A When you cut it out , Leave it to the next process B A refreshing tlb, And when the process A Among others cpu When it's scheduled again , It also faces a completely empty TLB( other cpu Of tlb Does not affect the ). Of course , If you distinguish global and local, that tlb The operation is basically similar , Only when the process switches , No flush The cpu All the tlb entry, It is flush be-all tlb local entry Just OK 了 .

Yes local tlb entry We can further subdivide , That's it ASID(address space ID) perhaps PCID(process context ID) The concept of (global tlb entry Indistinguishes ASID). If the support ASID( perhaps PCID) Words ,tlb The operation becomes easier , Or we don't have to implement tlb Operation , Because in TLB When searching, you can distinguish between them task Context , such , each cpu What remains of tlb It will not affect the execution of other tasks . In a single core system , This kind of operation can get good performance . such as A---B--->A In such a scene , If TLB Large enough , It can hold 2 individual task Of tlb entry( modern cpu It can also be done in general ), that A When you cut back again ,TLB yes hot Of , Greatly improved performance .

however , For multicore systems , There's a little bit of trouble with this situation , In fact, it is also the legendary TLB shootdown Performance issues . In a multicore system , If cpu Support PCID And not when the process switches flush tlb, So each of the systems cpu Medium tlb entry Keep all kinds of task Of tlb entry, When in some cpu On , A process is destroyed , Or modify your own page table ( That is to say, it has been modified VA PA The mapping relationship ) When , We have to put task Correlation tlb entry Get rid of the system . Now , You don't just need to flush Ben cpu Corresponding to TLB entry, It also needs to be shootdown other cpu On and should be task dependent tlb remnant . And this action is usually done through IPI Realization ( for example X86), This introduces overhead . Besides PCID The allocation and management of will also bring additional costs , therefore ,OS Do you support PCID( perhaps ASID) It's all about arch The code decides for itself ( about linux for ,x86 I won't support it , and ARM The platform is supported ).

Four 、 In process switching tlb Operation code analysis

1、tlb lazy mode

stay context_switch There's a piece of code in :

if (!mm) {
    next->active_mm = oldmm;
    atomic_inc(&oldmm->mm_count);
    enter_lazy_tlb(oldmm, next);
} else
    switch_mm(oldmm, mm, next);




This code means that if you want to cut in next task Is a kernel thread (next->mm == NULL ) Words , So you can go through enter_lazy_tlb Function marker book cpu Upper next task Get into lazy TLB mode. because ARM64 On the platform enter_lazy_tlb The function is empty , So we use X86 To describe lazy TLB mode.

Of course , We need some preparation , After all, to be familiar with ARM For embedded engineers of the platform ,x86 It's a little strange .

So far , We also describe it from a logical point of view TLB operation , But in practice , In process switching tlb Operation is HW Finished or SW It's finished ? Different processors have different ideas ( The exact reason is unknown ), Some processors are HW complete , for example X86, In the load cr3 When registers switch address space ,hw It will operate automatically tlb. And some processing needs software to complete tlb operation , for example ARM Series of processors , In the switch TTBR Register time ,HW No, tlb action , need SW complete tlb operation . therefore ,x86 On the platform , When the process switches , Software does not need to display calls tlb flush function , stay switch_mm Function will use next task Medium mm->pgd load CR3 register , Now load cr3 This will cause ben to cpu Medium local tlb entry Be all flush fall .

stay x86 Support PCID(X86 The term , Equivalent to ARM Of ASID) What would happen if ? Will also be in load cr3 When flush Lose all the local CPU Upper local tlb entry Do you ? Actually in linux in , because TLB shootdown, ordinary linux Does not support PCID(KVM Will use , But beyond the scope of this paper ), therefore , about x86 The process address space switch of , It will just have flush local tlb entry In this way side effect.

Another point is ARM64 and x86 Different places :ARM64 Support in one cpu core perform tlb flush Instructions , for example tlbi vmalle1is, take inner shareablity domain All in cpu core Of tlb All flush fall . and x86 You can't , If you want to flush There are many in the system cpu core Of tlb, Only through IPI Inform others cpu To deal with .

well , thus , All the preparatory knowledge has been ready 了 , We enter tlb lazy mode This theme . Although process switching is accompanied by tlb flush operation , But some scenarios can be avoided . In the following scene , We can not flush tlb( We still use A--->B task To describe ):

(1) If you want to cut into next task B It's kernel threads , So we don't need flush TLB, Because kernel threads don't access usersapce, And those processes A Residual TLB entry It doesn't affect the execution of kernel threads , After all B No user address space of its own , And and A Shared kernel address space .

(2) If A and B In an address space ( Two threads in a process ), So we don't need flush TLB.

Except for process switching , And others TLB flush scene . Let's start with a generic TLB flush scene , As shown in the figure below :

One 4 In the nuclear system ,A0 A1 and A2 task Belong to the same process address space ,CPU_0 and CPU_2 It's running separately on A0 and A2 task,CPU_1 It's a little special , It's running a kernel thread , But the kernel thread is borrowing A1 task Address space of ,CPU_3 Run on unrelated B task.

When A0 task I changed my address translation , So it can't just flush CPU_0 Of tlb, We also need to inform CPU_1 and CPU_2, Because these two CPU On the current active Address space and CPU_0 It's the same . because A1 task Modification of ,CPU_1 and CPU_2 These are cached on TLB entry It's no longer working , need flush. Empathy , It can be extended to more CPU On , in other words , In a certain CPUx Running on task Modified the address mapping relationship , that tlb flush It needs to be passed on to all relevant CPU in ( Current mm be equal to CPUx Of current mm). In a multicore system , Such passage IPI To pass on TLB flush The news will follow cpu core To increase by , Is there any way to reduce the unnecessary TLB flush Well ? Of course. , It's the one in the picture above A1 task scene , This is the legendary lazy tlb mode.

I'll look back at the code first . In the code , If next task It's kernel threads , We don't implement switch_mm( This function causes tlb flush The action of ), But the call enter_lazy_tlb Get into lazy tlb mode. stay x86 Under the architecture , The code is as follows :

static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
{
#ifdef CONFIG_SMP
    if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
        this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
#endif
}





stay x86 Under the architecture , Get into lazy tlb mode That is to say cpu Of cpu_tlbstate Set in the variable TLBSTATE_LAZY The state of OK 了 . therefore , Get into lazy mode When , You don't need to call switch_mm To switch the process address space , It won't execute flush tlb This is a meaningless move .enter_lazy_tlb Don't operate the hardware , Just record it cpu The state of the software is OK 了 .

After switching , The kernel thread enters the execution state ,CPU_1 Of TLB Residual processes A Of entry, This has no effect on the execution of kernel threads , But when other CPU send out IPI requirement flush TLB When? ? It should be right away flush tlb, But in lazy tlb mode Next , We can not carry out flush tlb operation . Here comes the question : When flush Drop the residue A Process tlb entry Well ? The answer is in the next process switch . Because once the kernel thread is schedule out, And cut into a new process C, So in switch_mm, Cut into C When the process address space , All the previous residue will be removed ( Because there is load cr3 The action of ). therefore , When executing kernel threads , We can postpone tlb invalidate Request . in other words , When I received ipi The interrupt requires that mm Of tlb invalidate When you move , There is no need for us to carry out , Just record the status OK 了 .

2、ARM64 How to manage ASID?

and x86 The difference is :ARM64 Support ASID( similar x86 Of PCID), Don't ARM64 It's solved TLB Shootdown The problem of ? In fact, I'm also thinking about it , But I haven't figured it out yet . Obviously , stay ARM64 in , We don't need to pass IPI To do all of this cpu core Of TLB flush action ,ARM64 Support... At the instruction set level shareable domain All in PEs Upper TLB flush action , Maybe it's the order that makes TLB flush It's not that expensive , Then you can choose to support ASID, There is no need to do anything during process switching TLB operation , meanwhile , Because there is no need for IPI To pass on TLB flush, Then there is no special treatment lazy tlb mode 了 .

since linux in ,ARM64 Select support ASID, Then it has to face ASID The distribution and management of . Hardware supported ASID There are limits , Its address space is 8 A or 16 individual bit, Maximum 256 perhaps 65535 individual ID. When ASID How to deal with the overflow ? This requires some software control to coordinate processing . We use hardware support up to 256 individual ASID To describe the basic idea : When each of the systems cpu Of TLB Medium asid It doesn't add up to more than 256 When it's time , The system operates normally , Once you surpass 256 After the upper limit of , We will all TLB flush fall , And redistribute ASID, Every time you reach 256 ceiling , Need to be flush tlb And redistribute HW ASID. Specific distribution ASID The code is as follows :

static u64 new_context(struct mm_struct *mm, unsigned int cpu)
{
    static u32 cur_idx = 1;
    u64 asid = atomic64_read(&mm->context.id);
    u64 generation = atomic64_read(&asid_generation);



    if (asid != 0) {-------------------------(1)
        u64 newasid = generation | (asid & ~ASID_MASK); 
        if (check_update_reserved_asid(asid, newasid))
            return newasid; 
        asid &= ~ASID_MASK;
        if (!__test_and_set_bit(asid, asid_map))
            return newasid;
    }







    asid = find_next_zero_bit(asid_map, NUM_USER_ASIDS, cur_idx);---(2)
    if (asid != NUM_USER_ASIDS)
        goto set_asid;


    generation = atomic64_add_return_relaxed(ASID_FIRST_VERSION,----(3)
                         &asid_generation);
    flush_context(cpu);

    asid = find_next_zero_bit(asid_map, NUM_USER_ASIDS, 1); ------(4)

set_asid:
    __set_bit(asid, asid_map);
    cur_idx = asid;
    return asid | generation;
}



(1) When a new process is created, a new mm, Its software asid(mm->context.id) Initialize to 0. If asid It's not equal to 0 So that means this mm It's been allocated before software asid(generation+hw asid) 了 , that new context But it will be software asid In the old generation Update to current generation nothing more .

(2) If asid be equal to 0, It shows that we really need to allocate a new HW asid, The first thing to do is to find a free one HW asid, If you can find (jump to set_asid), Then go straight back software asid( At present generation+ Newly assigned hw asid).

(3) If you can't find a free one HW asid, explain HW asid It's used up , It's only ascension generation 了 . Now , Many cpu All of the above old generation Need to be flush fall , Because the system is ready to enter new generation 了 . Here, by the way generation The variable has been assigned to new generation 了 .

(4) stay flush_context Function , control HW asid Of asid_map It's all cleared , therefore , What's going on here is new generation in HW asid The distribution of .

3、 During process switching ARM64 Of tlb Operation and ASID To deal with

Code is located arch/arm64/mm/context.c Medium check_and_switch_context:

void check_and_switch_context(struct mm_struct *mm, unsigned int cpu)
{
    unsigned long flags;
    u64 asid;


    asid = atomic64_read(&mm->context.id); -------------(1)

    if (!((asid ^ atomic64_read(&asid_generation)) >> asid_bits) ------(2)
        && atomic64_xchg_relaxed(&per_cpu(active_asids, cpu), asid))
        goto switch_mm_fastpath;

    raw_spin_lock_irqsave(&cpu_asid_lock, flags); 
    asid = atomic64_read(&mm->context.id);
    if ((asid ^ atomic64_read(&asid_generation)) >> asid_bits) { ------(3)
        asid = new_context(mm, cpu);
        atomic64_set(&mm->context.id, asid);
    }




    if (cpumask_test_and_clear_cpu(cpu, &tlb_flush_pending)) ------(4)
        local_flush_tlb_all();

    atomic64_set(&per_cpu(active_asids, cpu), asid);
    raw_spin_unlock_irqrestore(&cpu_asid_lock, flags);

switch_mm_fastpath:
    cpu_switch_mm(mm->pgd, mm);
}

When you see the code , You must be crazy : Expected to support ASID Under the circumstances , Process switching does not require TLB flush Is the operation of the ? How can there be so much code ? ha-ha ~~ In fact, ideals are beautiful , The reality is very backbone , Too much management is embedded in the code asid Content. .

(1) Cut in and get ready mm The address space that the variable points to , First of all, the memory descriptor is used to obtain the address space of ID(software asid). What needs to be explained is this ID Not at all HW asid, actually mm->context.id yes 64 individual bit, Which is low 16 bit Corresponding HW Of ASID(ARM64 Support 8bit perhaps 16bit Of ASID, But here we assume that the current system has ASID yes 16bit). The rest bit It's all software extension , We call it generation.

(2)arm64 Support ASID The concept of , In theory, process switching doesn't need TLB The operation of , But due to the HW asid The address space of is limited , So we expanded 64 bit Of software asid, Part of it corresponds to HW asid, The other part is called asid generation.asid generation from ASID_FIRST_VERSION Start , whenever HW asid After spilling ,asid generation Will accumulate .asid_bits It's hardware supported ASID Of bit number ,8 perhaps 16, adopt ID_AA64MMFR0_EL1 Register can get the specific bit number .

When it comes to cutting mm Of software asid Still in the current batch (generation) Of ASID When , You don't need any TLB operation , Can be called directly cpu_switch_mm Switch address space , Of course , It will also be set by the way active_asids This percpu Variable .

(3) If you want to cut into the process and the current asid generation atypism , So the address space needs a new software asid 了 , To be more precise, it needs to be pushed to new generation 了 . So here we call new_context Assign a new context ID, And set it to mm->context.id in .

(4) each cpu Cutting into a new generation of asid Space will call local_flush_tlb_all Will local tlb flush fall .

  reference :

1、64-ia-32-architectures-software-developer-manual-325462.pdf

2、DDI0487A_e_armv8_arm.pdf

3、Linux 4.4.6 Kernel source code

source :

http://www.wowotech.net/process_management/context-switch-tlb.html

(END)

 More exciting , All in "Linux Code reading field ", Scan the QR code below for attention 


 Don't forget to share 、 Like or watch ~

版权声明
本文为[osc_ d7or6cwg]所创,转载请带上原文链接,感谢
https://chowdera.com/2020/12/20201206110939713i.html