Monday, June 21, 2010

User space IO - Some Challenges & Mitigations

There is pretty good information about UIO in Internet.  This link provides good introduction to this subject.

What is UIO (User space I/O) framework?

UIO framework is part of Linux kernel to enable device driver development in user space.


Which applications require User space drivers?

Zero-Copy drivers are  becoming necessary for performance reasons.  Many network packet processing applications are traditionally used to be developed in Linux kernel space.  Firewall, NAT, IPsec are some of the examples which you find in the kernel space.  Increasingly,  these applications are moved to the user space for multiple reasons such as - Availability of large memory space,   Easy-to-debug,  faster image upgrade and restart and many more.    Moving these applications without moving the Ethernet driver or acceleration drivers reduces the performance.  Even though there are some efficient mechanisms to transfer packets between kernel and user space, they  still take some core cycles.  Having access to the hardware from the user space eliminates the need for any mechanisms to transfer packets back and forth between user space and kernel space. UIO frame work allows user space applications to own the device.  UIO frame work does this by letting the application kernel driver to map the hardware IO to the user space process.  UIO frame work also allows the application kernel driver to register interrupt handler with hardware IRQ and wake up the user space daemon upon hardware interrupt.  User space application upon getting indication from the UIO frame work,  reads the packets or acceleration results from the hardware memory directly without involving kernel.


What are components involved in UIO?

UIO frame work is part of Kernel itself.  Application developers need to develop one simple kernel module and user space application.   Kernel module as indicated above is expected to register interrupt handler with the UIO framework and also indicate the memory ranges (address, size pairs) to the UIO framework.  User space application open the appropriate UIO device /dev/uioX  (X being the minor number),  get hold of memory map ranges from 'sysfs' file system, do memory map and wait for the interrupt events either using 'read' or 'ePoll()' system calls.  When the read/ePoll returns,  it can read the content from the memory mapped area and do the actual application processing on the packets.

API exposed by UIO framework for application kernel modules:

uio_register_device(struct device *parent,  struct uio_info *info)

This function is expected to be called by the application kernel module.  'info' to be filled up with the right values.   At the end of this function,  UIO frame work creates the device /dev/uioX where X is dynamically assigned minor number.  This is the device which is expected to be opened by the user space program to read the interrupt events.

struct uio_info {
    const char        *name;
    const char        *version;
    struct uio_mem        mem[MAX_UIO_MAPS];
    struct uio_port        port[MAX_UIO_PORT_REGIONS];
    long            irq;
    unsigned long        irq_flags;
    struct uio_device *uio_dev;
    irqreturn_t (*handler)(int irq, struct uio_info *dev_info);
    int (*mmap)(struct uio_info *info, struct vm_area_struct *vma);
    int (*open)(struct uio_info *info, struct inode *inode);
    int (*release)(struct uio_info *info, struct inode *inode);
    int (*irqcontrol)(struct uio_info *info, s32 irq_on);
};

name, version:  Application driver can provide any string as part of it.  Since this 'name' field is used by user space application to figure out the device name (/dev/uioX),  it is necessary that the name field is chosen such a way that it is specific to your application and unique across UIO devices.  Note that value X in /dev/uioX is chosen dynamically by the UIO frame work.  X value can be different across restarts of Linux system.  So, if user space application hardcodes the device file in its code, then it could be an issue when the system restarts and different UIO devices register with UIO framework in different order.  'name' is the one which is constant across restarts as it is given by the application driver.  Since the device name is not constant,   user space application, upon its initi8alization, should find out the UIO device name based on the value of 'name'.  UIO frame work creates set of files under /sys/class/uio/ directory. Under /sys/class/uio/, all device names are present. As many sub directories as number UIO files are present in /sys/class/uio. If there are two UIO files, then there would be two sub directories -  /sys/class/uio/uio0/,  /sys/class/uio/uio1/. Under each uioX directory, there are set of files -  'name', 'version', 'event' and set of directories - 'maps' and 'device'.   'name' file contains name of the device given by the application driver in the first line.  'version' file contains the version string given by the application driver.   'events' contains the number of times the interrupt service routine called so far.

User space application software is expected to find out the right device name by scanning through the directory entries (using scandir()) in /sys/class/uio/ directory.  For each directory entry, it needs to open the file 'name', read the first line and check the name.  If it matches with the name the application is looking for, then note down the directory entry that has matching entry - uioX.  Use this to form /dev/uioX string to open the UIO device.  FD returned by opening the device can be used to read the interrupt events.  This FD can be kept even in epoll(). This is useful if your application requires to wait for event from multiple file descriptors.

struct uio_mem mem[MAX_UIO_MAPS]:   If your application requires to map the register space of hardware in your user space application, then application kernel driver is expected to fill this up.  Since there could be multiple memory ranges required to access the hardware and hence there is array of mappings,  UIO provides facility to give multiple memory ranges. 

struct uio_mem {
    const char *name;
    unsigned long addr;
    unsigned long size;
    int memtype;
    void  __iomem *internal_addr;
   ...
}

It is expected that application kernel driver fills up the array of memory ranges using above structure during registration time. User space application is expected to read memory ranges from the /sys/class/uio/uioX/maps/ directory and do the memory mapping using mmap().  If application kernel driver fills up four memory addresses, then there would be four sub directories  under /sys/class/uio/uioX/maps/ -  map0, map1, map2, map3 and map4.   Under each 'mapX' sub directory, there are three  files - name, addr,  and size.   'addr' file contains address and 'size' file contains 'size'.  See below for explanation.   User space application is expected to read all paris of 'addr' and 'size' and use mmap() function to map them  to its virtual space.   Some explanation of fileds of uio_mem before going into further details.
  • addr:  It could be physical,  logical or virtual memory. Mostly it would be physical address as hardware device memory is exposed here.
  • size:  Size of the memory that needs to be exposed to the user space.
  • name :  Name given to each memory range.  
  • internal_addr:  This is not meant for user space programs to do anything.  Kernel driver can initialize this for its own usage at later time by interrupt service routine or irqcontrol function.  Typically, this memory is mapped using ioremap().
One thing note is that the memory mapping is always with respect to page boundary.  Very often, the device memory does not start at the page boundary. Hence it is required that the user space application adds the right offset to the return address of mmap() to point to the right locations in the device.  User space application is expected to keep the 'offset' parameter for each memory range using 'name'.

mmap() function takes one parameter 'offset' (note that this offset is nothing to do with offset explained above).  This offset is normally given in the multiples of page size.  This offset field is used by UIO frame work to determine the the memory range that user space programs intends to map.  Note that Linux IO infrastructure allows UIO framework to have only one corresponding mmap() function. Whenever mmap() is called in user space,  mmap() function of UIO framework in kernel is called.  UIO mmap() function internally calls remap_pfn_range() function to map the memory.  Note that there is only one mmap() function. How does UIO know which memory range to use to map?  TO solve this issue, UIO expects the user space programs to pass offset which is N * getpagesize() where N being the memory map index.   UIO internally gets hold of memory map index from the offset field and use corresponding 'addr' and 'size' values.

irq:  If your hardware device requires to interrupt the user process, then the application kernel driver is expected to register the IRQ number with the UIO frame work.  If the hardware device does not have this facility or interrupt is not required, then '0' need to be passed to it.  UIO frame work also provides an API function ' uio_notify_event()' to wakeup the user process. This can be used by timer or other facilities to wake up the user space process if the hardware device does not support interrupts.

irq_flags: Kernel driver is expected pass these flags. These flags would be given to request_irq() function by the UIO framework.  Typically,  IRQ_SHARED flag is sent if the IRQ is shared across more than one hardware device.

uio_dev:   This is filled up by the UIO framework.  UIO frame work puts the its own private information in there.  For every registration, UIO frame work creates an instance of uio_dev and keeps it in there. It is not expected to be interpreted by application kernel driver. Any further calls to the UIO framework from the application kernel driver is expected to pass uio_info. UIO frame works gets its instance from the uio_info->uio_dev and use the information in there to do its processing.

irqreturn_t (*handler)(int irq, struct uio_info *dev_info):   This is main interrupt handler.  It is expected to be provided by the application kernel driver.   Application kernel driver implements the interrupt service routine as required by the device. Waking up the user process is taken care by the UIO frame work itself.  UIO framework sets its own function as interrupt handler while calling request_irq() function.  That is, when there is an interrupt,  UIO framework gets the control first.  It calls the application driver handler function and then it does whatever is necessary to wake up the user process.  Hence the application driver handler does not worry about waking up the user process.  More often, my observation is that the application driver interrupt handler function does not do much.  Mostly, it just disables any interrupt generation by programming the device registers.  What should be done in the application driver handler depends on the hardware device capabilities.
  •  Hardware devices typically have capability for software to mask/unmask interrupt generation.  They also provide ability for software to indicate to the hardware to generate interrupts only for new events by acknowledging the previous events. and hardware devices generate interrupt, if new packets have come in , when interrupt is enabled.   If hardware has these capabilities , then the kernel handler typically disables the interrupt generation.  User space process upon being woken up,  indicates to the hardware to generate interrupts only for new events from now onward,   reads all device events in a loop (packets, results etc..) and then enables the interrupt.  This method automatically provides coalescing capability.  User space process is woken up upon first event and interrupts are disabled by the kernel handler. By the time, user space process woken up, it processes not only the event that had woken this up, but also any other events that have come after that.  
  • Note that the user space process or thread may be processing packets from multiple UIO devices.  In this case,  if the  user process processes all the packets coming from one device in a loop until all events are read, then there is a chance that packets from other devices are not handled in timely fashion.  It is expected that all devices are given fair chance.   One way to take care of this is to have one thread each for devices. But that may not be efficient.  It appears that the performance is best if the number of packet processing threads are equal to number of cores/HW threads.  There could be more devices than the threads.  Due to efficiency reasons, one thread may need to work with multiple devices.  In these cases, to give the fairness across devices,  it is necessary that thread handles only 'quota' number of packets from each device before revisiting the devices again.  This concept is similar to NAPI model adopted in Linux Ethernet drivers.
  •  
 int (*irqcontrol)(struct uio_info *info, s32 irq_on) :  This function pointer is filled up by application kernel driver to allow user space process to explicitly enable/disable interrupt generation by the hardware device.  This function gets called by UIO infrastructure when the user space process calls write() function on the UIO device fd.  Normally, irq handler disables interrupt generation and user space process enables interrupts using mapped memory. Some hardware devices might have race conditions if two contexts update interrupt mask related registers. This can happen when the mask register is used for other purposes.  In these cases,  central control of enable and disable is necessary.  But modern hardware devices don't have this issue and hence this function registration is not required.

 int (*mmap)(struct uio_info *info, struct vm_area_struct *vma) :  In usual cases, application kernel driver need not set this pointer.  UIO infrastructure has its own mmap() function defined which can do the memory mapping when user space calls mmap() function.  UIO Infrastructure itself can do the mapping using uio_mem mapping array.  Yet times, the number of entries needed to map could be more tham MAX_UIO_MAPS. In that case,  UIO infrastructure will not be able to do the mapping.  In this case, application kernel driver will need to provide mmap() function pointer and do the mapping necessary.  

Even though UIO framework provides  application kernel driver to indicate the memory ranges to map or register application specific mmap() function pointer,  more often I see that both of them are not used.  UIO framework predominantly used only for registering the interrupt handler to wake up the application user process. Many times, application kernel driver itself made as character devices driver with its ioctl() and mmap() functions in addition to open(), close().    There are multiple reasons for doing this. One of  the reason is given below.
  • Applications not only require to map the device specific memory locations, but also map the kernel memory for packet/acceleration-result buffers. UIO infrastructure does not provide this.  Ethernet hardware devices are typically expose descriptor rings of descriptors to receive packets.  Application is expected to provide buffer in each descriptor.  Ethernet controller fills up the buffer in the descriptors with incoming packets.  Buffers that are to be given to the Ethernet controller must be physical addresses. Current generation of Multicore SoCs don't have capability to convert from virtual space to physical space internally.  Hence physical addresses need to be provided for buffers that go in receive descriptors.  Since Linux user space does not have physical memory with it, it needs to get this memory from the kernel space.  Application kernel driver does this job.   User space program asks the kernel driver to allocate and map the memory to user space.  When the mmap() in  user space returns, it has the virtual address.  It gets the physical address of allocated buffer from the kernel driver and uses the physical address while programming the hardware and uses virtual space while using it in its program.  User space programs typically ask kernel driver to allocate big amount of memory and then asks that memory to be mapped. Packet buffers are allocation/free is done from  this big chunk.   

Applications may require big chunks of memory blocks for several reasons - packet buffers, acceleration results and even for local contexts.   But there is only one mmap() function and there are no special arguments by which user process indicates the purpose to the application driver.  Hence, it is necessary that there is some kind of protocol between user space process and the kernel driver.  One method that is typically followed is to indicate the purpose via one IOCTL command, then do mmap() and another IOCTL command to know the base address of allocated memory.  Let us say that there are two different memory chunks to be allocated - Chunk1 of size 128Kbyets for packets   and Chunk 2 for acceleration results of size 64K.  Then the sequence by which user space calls the kernel driver through FD are:

ioctl(fd,  SET_PURPOSE,  argument consisting of  type 'CHUNK1',  size '128K')
mmap()
ioctl(fd,  GET_MMAP_RESULT, argument consisting of 'physical address').

Similar sequence need to be followed whenever Chunk2 is required.

Kernel Driver need to keep the information given via SET_PURPOSE in its private information. When kernel driver mmap() function gets invoked, then it allocates memory using  kmalloc(),  calls remap_pfn_range.  It stores the address returns by kmalloc() in private information.  This is given back to user space when GET_MMAP_RESULT command is issued by user space.  All these three operations need to happen atomically.  Kernel driver may like to ensure the sequence and return error if new sequence is started before old sequence is completed.

 int (*open)(struct uio_info *info, struct inode *inode),  int (*release)(struct uio_info *info, struct inode *inode) :   These function pointers are can be set by application kernel driver to get hold of control whenever user space applications open or close the UIO device. It can do any cleanup necessary. 

Example Program



  

Saturday, June 19, 2010

IPv6 and eNodeB in LTE world - Technical bit

Does IPv6 migration of UEs and PDN Gateway require any support from eNodeBs?  eNodeB, MME, HLR, AAA and SGW need not be upgraded immediately along with PDN gateway to support Ipv6 external connectivity to the UEs.

eNodeBs communicate with the other 3GPP network elements for signaling and transporting the UE traffic over GTP tunnel.  They relay traffic from PDCP layer to the GTP tunnels and vice versa.  As such they don't look deep into the data as such. It could be IPv4 or Ipv6 packet.  GTP layer in eNodeB is expected to pick up the data, snap the GTP-U/UDP/IP header and send it over to the SGW by securing with IPsec.  GTP-U can use IPv4 connectivity to talk to SGW and eNBs etc.. Similarly signaling communication can happen on Ipv4 network.

PDCP RoHC (Robust Header Compression) is the only mandatory component in the eNodeB which tried to interpret the packets to/from UEs.  I believe the first generation of eNBs would expect PDCP RoHC to support both IPv4 and IPv6.  Some eNBs have some sophisticated functionality such as 'Application Detection'. Any component that interprets the user traffic would need to have IPv6 intelligence.  I believe eventually all the components such as GTP-U, IPsec,  QoS  require IPv6 support as operators move to IPv6 only networks.

By the way, I saw one IETF draft on IPv6 in 3GPP EPS.  Please find it here.  

Sunday, June 13, 2010

Avoiding double IP reassembly in eNodeB and IPsec Gateway in LTE - Red Side Fragmentation

I have given one use case long time back for Red side fragmentation feature. Please find it here.  There is one more use case where one can avoid double IP reassembly.

In LTE world,  you have eNodeB and SGW (Serving Gateways) communicate using IPsec  over backhaul network.  Ipsec functionality is normally part of eNodeB.  But on the SGW side, IPsec is normally deployed in a separate network element (IPsec Gateway).  Due to this, there could be double IP reassembly in eNodeB. Avoiding reassembly can improve eNodeB performance. This can be achieved by configuring 'Red side fragmentation' on Ipsec GW in core network near SGW. 

Let us step back and see the packet processing.  Any packet between eNodeB and SGW are first tunneled using GTP tunnel.  Then they get tunneled using IPsec to secure the traffic.  Let us all assume that MTU of all the links is 1500. It is not a bad assumption at all as interfaces of these network elements in many deployments are Ethernet interfaces.  Let us also assume that 1500 byte packet is being sent out from SGW to eNodeb (downlink packet).  GTP in the SGW adds GTP/UDP/IP header to the packet.  Since MTU of transmitting interface is 1500, this packet would be broken down into two - one packet with 1500 bytes and another packet with rest of data.  Now this packet goes to the IPsec gateway .  IPsec adds ESP/IP header to the first 1500 byte fragment and second fragment.  First fragment after Ipsec is done exceeds the MTU and it gets fragmented.  So, there are now total three packets for the original 1500 byte packet.  eNodeB upon receiving the packets, first needs to reassemble the fragments for Ipsec inbound processing consumption.  For GTP-U consumption, it needs to reassemble the fragments which went through the Ipsec inbound processing.

How do we avoid the double reassembly?  I can think of three options.

  • Combine GTP and Ipsec in the same network elements. Packets will get fragmented only after GTP and Ipsec outbound processing is done.  Hence only two fragments would be generated and peer only needs to do only one reassembly.  Note that this option may not be possible on SGW side due to scalability reasons.  GTP and IPsec are always  together in case of eNodeB though.
  • Configure IPsec gateway to do reassembly before it does IPsec outbound processing:  Here GTP-U might have broken the packet into two. Ipsec gateway reassembles them and passes the reassembled packet to the IPsec outbound processing.  Ipsec gateway after outbound processing would fragment the packet if necessary.  Peer (eNodeB) will only see two fragments and only one reassembly is required. But note that this would have overheads in Ipsec gateway. This may be okay as ipsec gateway is normally deployed as separate element and the CPU might be entirely dedicated for this.
  • Better option is in my view is to enable 'Red side fragmentation' in the IPsec gateway.  1500 byte GTP-U packet it receives get fragmented before IPsec processing is done. Three packets in total would go through the IPsec processing and would be sent out. There is no further fragmentation.  On eNodeB,  three fragments would get reassembled after Ipsec inbound processing is done and before GTP-U detunnel processing.  Only one reassembly is required.  It has its own advantage, but I believe this is better option. Disadvantages being - It requires 3 fragment reassembly on eNodeB.  On IPsec gateway, more packets would go through Ipsec engine. Since many Ipsec gateway use hardware accelerators, I am thinking that this additional processing will not affect the overall throughput.
Comments?

Saturday, June 12, 2010

LTE Network Sharing (MOCN) - eNodeB

Typically eNodeB and Core network components are owned and operated by same operator. It is quite common to share the physical infrastructure among multiple operators, specifically the cell sites - physical location, building etc... But every operator used to have their own base stations and transport card etc..  As I understand that is called passive sharing.  This sharing is now extended to active component sharing such as eNodeB. That is called network sharing or active sharing.

Networking sharing allows one frequency spectrum ,  eNodeB shared by multiple operators. Dedicated cells for each operator but still sharing rest of E-UTRAN infrastructure is also possible.

There are multiple business reasons for network sharing.Main reason being cost savings.
  • Cost savings by sharing infrastructure - eNodeB and Frequency spectrum.
    • LTE deployments will require major investments as new eNBs, new antennas need to be installed by operators.  To reduce the number of eNB one needs to own,  in some cases such as rural areas there by providing connectivity while reducing the costs.
3GPP body as part of LTE effort visualized these scenarios and created specifications to handle network sharing scenario. 3GPP specifications  22.951 and 23.251 describe the network sharing requirements and architecture & functional description of network sharing. Main feature that allows network sharing in LTE is that eNodeB broadcasts multiple PLMN IDs (Public Land Mobile Network ID which is combination of Mobile Country Code and Mobile Network Code - Each operator has unique PLMN ID) to the UE using SIB (system Information Block).  UE is expected to select the PLMN ID based on its selection process.  Using the selected PLMN ID,  UE is expected to make RRC connection with eNodeB.  eNodeB uses this PLMN ID to select the core network and in turn MME.

This feature of eNodeB serving multiple operator is also called 'Multi Operator Core Network' (MOCN).

Whenever a hardware is used for Multiple operators, there would be fairness would come in picture.  Due to this feature, identification of contexts (whether it is PDCP, GTP, RLC, MAC, IPsec or Qos) will need one  additional parameter - Operator ID.  Note that TEID which is used to terminate the GTP tunnels may not be unique across multiple operators.  In addition to this,  I believe following features would be expected from eNodeB to support MOCN.
  • Additional identification parameter, Operator ID in user plane modules.
  • Fairness in allocating resources in eNodeB (Buffers,  Contexts etc.. ) and Radio resource management if same cell is used by multiple operators.
  • VLAN Support - One or few dedicated VLANs for each operator.  
  • DHCP Client (Ipv4, IPv6) - Multiple instances :  eNodeBs typically get the IP address from the DHCP Server.  Since there are multiple VLANs due to multiple operators, DHCP client also needs to be capable to get multiple IP addresses - one for each operator (VLAN).  
  • If other L3 connectivity protocols used instead of DHCP such as PPP, then one needs to ensure that these L3 connectivity protocols too get multiple IP addresses - one for each operator.
  • eNodeB should ensure that right source IP addresses are used for GTP tunneling and Ipsec tunneling.
  • Fairness to ensure that one operator traffic does not overwhelm the CPU
    • Radio bandwidth is normally taken care as part of radio resource management on per operator basis.
    • Incoming traffic from backhaul network is expected to be policed at the ingress port. Each operator VLAN may be configured to police the traffic or schedule the traffic coming from different VLANs to CPU fairly (weighted if configured).  Due to scheduling, packets may be pending in the queues for future scheduling.  This can eat up buffers and new packets may not get received.  So, there should be limits on number of buffers each VLAN can occupy at any time.  
    • Outgoing traffic to Bachaul network also need to be controlled on per operator (VLAN) basis.  Note that all VLANs share the same physical link. Hence the outgoing traffic needs to be controlled on per VLAN basis to ensure that physical link is not overwhelmed.  Traffic shaping and scheduling on per VLAN basis is expected.  Within each VLAN,  priority based queuing might also may require traffic shaping & scheduling. Hence, eNodeB is expected to provide hierarchical shaping and scheduling.  

Saturday, June 5, 2010

LTE PDCP from eNodeB perspective

Packet Data Convergence Protocol (PDCP) is one of the User plane protocols in LTE. It is present in UE and eNodeB.  This protocol sends and receives packets to and from UE and eNodeB over air interface. This protocol works along with other L2 protocols RLC  (Radio Link Control) and MAC (Medium Access Control).

PDCP layer works on top of RLC and it transfers the UPLINK packets to the GTP layer which in turn tunnel the packets to core network (Evolved Packet Core - EPC).   It receives the downlink packets from GTP layer and send them onto RLC which in turn sends them to UE. That is PDCP layer sits in between RLC and GTP layers.

This particular post talks about PDCP layer details in eNodeB.  PDCP is user plane protocol. Control plane protocol RRC configures the PDCP entities in the user plane.

PDCP layer is described in 3GPP 36.323 standard.

PDCP functions : 

PDCP layer is expected to do following:

  • Security function over the air interface :  
    • Ciphering and Deciphering of user plane and control plane data.
    • Integrity protection and verification for control plane data:  Note that there is no integrity protection offered to the user plane data.
    • Sequence number is used to detect anti replays.
  • Header compression and decompression for user plane data:  Note that there is no header compression for control plane data.  ROHC (RFC 4995) is used to reduce the headers to save bandwidth of air interface. ROHC is mandatory for voice traffic.  Note that in LTE, both voice and data use packet switching.  Typically for every 32 bytes of voice data around 40 bytes of headers are added (RTP, UDP, IP) in case of IPv4 and around 60 bytes get added in case of IPv6. That is quite a bit of overhead.  ROHC is expected to reduce the over head to few bytes.  For data traffic, ROHC is not mandatory, but it is good to have.
  • Handover :  As discussed in earlier post  there are two types of handovers - seemless and lossless.  Seemless handover is typically used for radio bearers carrying control plane data and user plane data that is mapped to RLC UM (Unacknowledged mode).  In seemless handover,  header compression contexts are reset and Sequence numbers (COUNT) values are set to 0 in the target eNB.  PDCP SDUs that have not been transmitted will be sent to the X2 interface (S1 interface in case there is no X2 connectivity)  GTP tunnel to the target eNB.   Lossless handover is typically applied to the radio bearers that are mapped to RLC AM.  In this handover mode too,  header compression context is reset, that is,  ROHC context is not transferred to the target eNB. In lossless handover pending downlink packets -  PDCP SDUs for which no Acks are received from UE,  PDCP SDU which were not transmitted and new GTP packets that are coming from S1 interface in source eNB - are sent to the target eNB.  Similarly, uplink packets which are received out-of-order also sent to the target eNB. Control plane in source eNB sends the 'Next Transmit Seq num' and 'Next expected receive sequence number' to the target eNB for each RAB.  Optionally it also sends the bit map of PDCP sequence numbers of the packet which it would expect UE to retransmit.  This information is passed to the the target eNB via SN-STATUS-TRANSFER.  I guess this information would be used by target eNB PDCP to send the PDCP status transfer control message.
  • Discard function :  This allows packets to be discarded if PDCP layer did not successfully send the packets for 'discard timeout' time.  
  • Duplicate discarding : If PDCP layer receives duplicate packets (packets with same sequence number), the it discards them and does not send them to upper layers.
Some points to note :

PDCP specification goes in great lengths on PDCP data transfer procedures and details out internal implementation such as state variables to be maintained for received and transmit operations. These state variables are used to assign sequence numbers during transmit time and verify/discard/send up the packets which are received from RLC layer.  I will not go into those details here as the specification is very clear on them.  One thing to note is that these procedures are described from UE context. Same are valid for eNB PDCP implementation too. But note that UE PDCP transmits the UL packets to RLC  and receives DL packets from RLC.  In case of eNodeB,  PDCP layer transmits DL packets to RLC  and receives UL packet from RLC.  So, keep in mind while going through spec document.

There are two kinds of PDCP bearers:  SRB (Signalling Radio Bearer) and DRB (Dedicated Radio Bearer). There are only two SRBs - SRB1 and SRB2.  These are used by control plane protocol to send the packets to the UE.  DRBs are used for sending voice and data.  There would be as many DRBs as number of QoS streams.

    There are two kinds of packets in PDCP -  Data packets and Control packets. Packets that are sent on  SRBs and DRBs use data packet format. Control packets are used by ROHC to provide feedback to the compressors from decompressors.  Control packets are also used by PDCP layer to send the PDCP sequence number status to the peer (Packet sequence numbers that are received out-of-order).  

    As discussed before sequence numbers are sent along with the data packets to the  peer to do the in-order delivery of packets to its user entity.  To preserve the the bandwidth on the air, only least significant bits of sequence number is sent.  Most significant bits are called HFN.  Based on the window size, there are two sizes are chosen for sequence number that are sent along with the packet - 7 bits (User plane Short SN)  and 12 bits (User plane long SN). Typically short SN is used for UM mode and long SN is used for AM mode.

    There is one PDCP context for radio bearer. PDCP context is identified by four tuples - Virtual Instance ID,  Sector ID,  C-RNTI and LCI (logical channel identifier). Please see this post for more details on virtual instance ID and sector ID.  Both sides of PDCP - RLC and GTP - would use same identifiers to identify the PDCP context. Hence only one search table is good enough (Implementation information).

    There is one to one correspondence between PDCP SDU and PDCP PDU. That is there is no segmentation and concatenation functions in PDCP layer.  Addition of PDCP header,  applying compression and security on the PDCP SDU makes the PDCP PDU.  Similary deciphering, decompression and removal of PDCP header makes PDCP SDU from PDCP PDU.

    PDCP Status report is expected to be generated as part of PDCP reestablishment if the RB is setup to send the status report. This report is sent for PDCP PDUs that are received from RLC (UPlink packets).  Fields that are to be sent along with the status report is described in section 5.3.1 of 36.323 spec.

    PDCP Layer in eNB also may receive the PDCP Status report from the UE indicating out-of-order packets it had received. PDCP layer is expected to work on removing the PDCP SDUs that are pending in the transmit queue and acknowledged by the peer via status transfer message.  Also, it is expected to send the status report to control plane as CP may require to send to the target eNB if it is in handover stage for that UE.

    I always wondered how and when PDCP generates the status reports. From the UL & DL Data transfer procedures in Section 5.1 of 36.323,  one can observe that the PDCP PDU on receive direction (from RLC) don't get stored in PDCP layer.  They are given to the upper layers immediately after Security and RoHC processing.  PDCP layer assumes that the packets are given in order by the RLC and hence don't need to store the packets to do inorder delivery to the upper layers.  Description of "PDCP Status Report" in section 5.3 of 36.323 says that status report is sent for the PDUs which were received out-of-order. It gives an impression that packets are stored in the PDCP layer for inorder delivery to the upper layers.  If not, how does PDCP layer send the bit map indications of the out-of-order packets.  So, I thought section 5.1 and section 5.3 are contradictory.  Then there is some enlightenment.  This only happens during PDCP reestablishment time.  Control plane protocols when they indicate reestablishment to the local RLC,  RLC layer sends the PDCP PDUs (RLC SDUs) it has to the PDCP layer as is basis. That is, there could be some missing packets at that time.  THis is only case, PDCP layer gets  the packets with some missing PDCP PDUs.  PDCP layer is also informed of reestablishment by the control plane.  PDCP upon receiving the packets from RLC layer is expected to send the status report to the peer with bitmap of packet sequence numbers so that the peer PDCP can remove the SDUs at its transmit side that were acknowledged in the status message.

    I also had some confusion on PDCP discard for some time.  PDCP specification says that (Section 5.4 of 36.323) timer is started for every packet that is submitted to the PDCP layer by upper layers. If there is no successful transmission acknowledgment from the local RLC for the packet within 'discardTimer' timeout value, then PDCP can drop the packet.  I was thinking for some time that how does remote PDCP layer know about this drop. I thought remote peer waits on the packet (sequence number) endlessly . But from UL/DL data transfer procedures, it is clear that PDCP receiver does not wait on any missing sequence number packet as its window keeps moving right.

    PDCP Interfaces:
    PDCP layer interfaces with three neighboring modules -  RRC control plane,  RLC and GTP.  Ofcourse, there is initialization, configuration, monitoring etc.. Following sections describe the interface with RRC, RLC and GTP.

    RRC to PDCP interface :  Following interface would need to be exposed by PDCP user plane layer to the RRC in control plane.
    •  Interface to create PDCP Contexts for SRB and DRB in PDCP layer :  Parameters for this function at high level are:
      • Virtual Instance ID,  Sector ID, C-RNTI (Cell - Radio network Temporary Identifier).
      • LCI (logical channel identifier).
      • Reference to the control plane:  Some opaque information to pass along with indications to the control plane.
      • SRBOrDRB boolean flag
      • If SRB:
        • Unacknowledged or Acknowledged Mode
        • If it is Unacknowledged, then the direction of the packets (Transmit only, Receive only or both).  In acknowledged mode, it is assumed that the directios is 'Both' always.
        • Integrity Information (Y/N)
          • Algorithm 
          • Key
        • Cipher information (Y/N):
          • Algorithm
          • Key.
        • There is no RoHC for SRB.
        • Sequence number size is not configured for SRB by RRC. Use configuration 'Default SN Size for SRBs' to find the sequence number size.
      • If DRB:
        •  Active/Inactive:  Normally it is active. But in handover cases,  target eNB might create the PDCP context as part of Handover preparation and program the PDCP starting TX sequence number and Receive expected sequence number and Bit map when the X2 protocols sends SN_STATUS_TRANSFER message (Refer to 36.300 Figure 10.1.2.1.1-1:  Intra MME/Service Gateway HO).  Since these two events happen at different times, it is required that PDCP does not start processing the packets until PDCP sequence numbers are known to it.  To facilitate this, I believe control plane first creates the PDCP context in 'Inactive' mode and activates it at later time using some other API function.  If PDCP receives the SDUs from the upper layer while it is inactive, it is expected to hold them from processing until it is activated. 
        • Unacknowledged or Acknowledged Mode
        • If it is Unacknowledged, then the direction of the packets (Transmit only, Receive only or both).  In acknowledged mode, it is assumed that the directios is 'Both' always.
        • Integrity Information is not valid for DRB packets.
        • Cipher information (Y/N):
          • Algorithm
          • Key.
        • RoHC (Y/N)
          • Profile IDs:  Bit mask of compression profiles (RTP/UDP/IP,  UDP/IP,  ESP/IP, IP, TCP/IP , v2 RTP/UDP/IP, v2 UDP/IP, v2 ESP/IP and v2 IP)
          • maxCID:  Maximum flows.  
          • Large CID is derived from the max CID.  If maxCID > 15, large_cid is true else large_cid is false.
        • Sequence number size : 5 bits, 7 bits and 12 bits.
        • Handover case:  if this PDCP context is established in target eNB, it also sends the PDCP sequence numbers and Bit map of packets that were not received by source eNB.
    •  Interface to terminate PDCP contexts :  I am not sure whether there is any need to provide deletion of each individual bearer.  I have a feeling control plane does not delete each one of them.  So, it is required to have terminate bearer function for all bearers belonging to UE.  Parameters:
      • Virtual Instance ID,  Sector ID, C-RNTI
    • Interface to prepare PDCP context for re-establishment:  As part of this, PDCP is expected to wait for the packets sent by RLC which came in out-of-order from UE.  These packets would be processed and given to GTP-U.
    • Interface to set PDCP reestablishment on per SRB and DRB basis in PDCP layer :  This interface function is expected to be called by control plane when it requires reestablishment of PDCP context.  This function is expected to send PDCP status transfer message to the UE and also it is expectedt set the Cipher and integrity information to the context. If there are any pending DL packets, they get retransmitted with new cipher context. Parameters:
      • Virtual Instance ID, Sector ID, C-RNTI, LCI to identify the bearer.
      •  Cipher Information (Y/N): As part of reestablishment new keys may be established.
        • Algorithm
        • Key
      • In case of SRB, integrity information (Y/N):
        • Algorithm
        • Key 
      • Please refer to the Section 5.2 of 36.323 to understand how to setup transmit and receive sequence number for different modes.
      • ROHC context is reset if it is applicable.
    •  Interface to indicate the handover of  DRBs:  This function is expected to be called by control plane in the source eNB as part of handover execution phase.  When the PDCP layer gets this indication from the control plane, it should start forwarding the un-acknowledged downlink packets and uplink packets that are received out-of-order. It is expected that this function is called by control plane after it instructs (re-establishment) RLC. 
    •  Interface to send control messages via SRBs:  SRBs are used by control plane. This function can be called by control plane to send the packets on SRB.
    PDCP to RRC interface: 
    •  Interface to inform RRC of received Status Transfer message from peer : Using this interface point, PDCP informs the content of status transfer message to the control plane.  It sends the 'Reference information' that was set in the PDCP context by control plane during PDCP context creation. It helps control plane to corelate its context easily. Information from this indication would be used by RRC during handover execution phase.
    •  Interface to inform SRB data indications : This interface function gives the messages received on SRB from peer PDCP to the control plane.
    PDCP to RLC interface:
    • Interface to send PDCP PDUs including PDCP control and data PDUs :  This interface point is needed to send the PDCP PDU to RLC.  RLC also uses same identification parameter to match its context as PDCP does. Parameters include Bearer identificatio (Virtual instance ID,  Sector ID, C-RNTI, LCI), PDCP PDU packet buffer and message ID.  It is expected that message ID is returned when the RLC calls ack function to provide success & failure of delivery.  It helps PDCP implementation to find the matching SDU, stop the discard timer and remove the entry.
    RLC to PDCP interface :
    • Interface to indicate the new PDCP PDUs (new packets) - Multiple of them:  This function can be used by RLC to give PDCP PDUs to the PDCP layer.   Multiple packets can be given at one time.  RLC might be buffering the packets if they come in out-of-order. When the missing packet comes along,  All those packets can be given at once. 
    • Interface to indicate the pending PDCP PDUs - During reestablishment time (Multiple packets can be sent using one call) :  This function can be used by RLC to give pending PDCP PDUs in the RLC.  Along with the last packet, it can indicate that it is last packet. 
    • Interface to indicate acknowledgment of PDU sent using PDCP to RLC interface functions :  This function is expected to be used by RLC to give success/failure ack to the PDCP PDUs that were sent to the RLC before.
    PDCP to GTP interface:
    • Interface to send PDCP SDUs to the GTP : This function is used by PDCP to give PDCP SDU (IP packet) to the GTP layer.
    • Interface to send Downlink forwarding packets (Upon handover) to the GTP layer & Interface to send Uplink forwarding packets (Upon handover) to the GTP layer:  This function is called when the PDCP layer is informed of handover for a given context.  Both UL and DL packets are to be sent along with sequence number to the GTP. Last packet is expected to be indicated explicitly.  Since GTP waits for the last packet indication,  it is necessary that GTP is informed of 'no more packets indication' even if there are no packets to forward to target eNB.
    GTP to PDCP interface:
    • Interface for GTP layer to send new packets to the PDCP layer in downlink direction :  This function is called by PDCP upper layers to send the packets to PDCP.
    • Interface for GTP layer to send DL-forwarded packets (During handover) & Interface for GTP layer to send UL-forwarded packets (During handover) : THis function is normally called in target eNB during handover execution phase.  These packets are sent to the PDCP layer with the sequence number. 
    I have written this description based on my understanding of PDCP specifications. If somebody finds any inconsistency or if i had made any mistakes, please drop a comment or send an email.