CSE791 IoT: Notes for Apr. 10, 2020

Character devices

scull drivers

scull” is short for “Simple Character Utility for Loading Localities”. It’s a collection of sample character devices provided by Linux. All the buffer of scull is stored in main memory. There are the following types of scull device:

These scull drivers are basically Linux’s tutorial to teach us how to write drivers for different types of devices.

Major & Minor Numbers

When we list the devices nodes in /dev folder, in the size field, we are getting two numbers separated by a comma. These are actually unique identifiers for each device. The first number is called major number, the second number is called minor number.

(base) user@machine:/dev$ ll
total 4
drwxr-xr-x  22 root root          5220 Apr  8 17:41 ./
drwxr-xr-x  26 root root          4096 Apr  8 17:41 ../
crw-rw-rw-   1 root root       10,  56 Apr  8 17:41 ashmem
crw-r--r--   1 root root       10, 235 Apr  8 17:41 autofs
crw-rw-rw-   1 root root      511,   0 Apr  8 17:41 binder

Major number

Minor number

Major/minor number in kernel programming

// In Linux, dev_t is a 32-bit integer, where
// 12 bits are used for major number and 20 bits
// are used for minor number

// getting major/minor numbers from dev_t
MAJOR(dev_t dev);
MINOR(dev_t dev);

// combining major and minor numbers into dev_t
MKDEV(int major, int minor);

Allocating device numbers

Static allocation

In static allocation, you specify the major number.

int register_chrdev_region (
  dev_t  	from,
 	unsigned  	count,
 	const char *  	name);

Dynamic allocation

In dynamic allocation, the major number will be dynamically allocated.

int alloc_chrdev_region (
  dev_t *  	dev,
 	unsigned  	baseminor,
 	unsigned  	count,
 	const char *  	name);

Registering device nodes

After calling alloc_chrdev_region, the device name and major number will appear in /proc/devices.

(base) user@machine:/proc$ cat devices
Character devices:
  1 mem
  4 /dev/vc/0
  4 tty
  4 ttyS
  5 /dev/tty
  5 /dev/console
  5 /dev/ptmx
  5 ttyprintk
  6 lp
  7 vcs
 10 misc
 13 input
 21 sg
 29 fb
 81 video4linux

Block devices:
  7 loop
  8 sd
  9 md
 11 sr
 65 sd
 66 sd
 67 sd
 68 sd
 69 sd
 70 sd

Now, we need to register a file system node for the device. It can be done by using a device loading script, which usually looks like the this:

#!/bin/sh
module="scull"
device="scull"
mode="664"

# invoke insmod with all arguments we got
# and use a pathname, as newer modutils don't look in . by default
/sbin/insmod ./$module.ko $* || exit 1

# remove stale nodes
rm -f /dev/${device}[0-3]

major=$(awk "\\$2= =\"$module\" {print \\$1}" /proc/devices)

mknod /dev/${device}0 c $major 0
mknod /dev/${device}1 c $major 1
mknod /dev/${device}2 c $major 2
mknod /dev/${device}3 c $major 3

# give appropriate group/permissions, and change the group.
# Not all distributions have staff, some have "wheel" instead.
group="staff"
grep -q '^staff:' /etc/group || group="wheel"

chgrp $group /dev/${device}[0-3]
chmod $mode /dev/${device}[0-3]

Miscellaneous device

Linux provide a template to create simple character devices, which are called “miscellaneous devices”. The TUN/TAP driver is both a network interface and a misc device. The major number of all misc devices is 10.

static struct miscdevice tun_miscdev = {
	.minor = TUN_MINOR,
	.name = "tun",
	.nodename = "net/tun",
	.fops = &tun_fops,
};

More information on misc device:

Important data structures

File operations

The file_operations structure registers methods for device operations. These are some important members of this structure:

More on file_operations:

The iov_iter interface

In kernel programming, one often needs to process buffers supplied by user space. If one writes this function directly, it is really easy to get something wrong. Therefore, the kernel provides iov_iter interface to simplify this process.

An iov_iter is essentially an iterator for iovec structure, which is defined as:

struct iovec
{
	void __user *iov_base;
	__kernel_size_t iov_len;
};

This structure matches the user-space iovec structure used by system calls like readv and writev.

ssize_t readv(int fd, const struct iovec *iov, int iovcnt);

The readv() system call reads iovcnt buffers from the file associated with the file descriptor fd into the buffers described by iov (“scatter input”).

ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

The writev() system call writes iovcnt buffers of data described by iov to the file associated with the file descriptor fd (“gather output”).

The definition of iov_iter:

struct iov_iter {
	int type;
	size_t iov_offset;
	size_t count;
	const struct iovec *iov;
	unsigned long nr_segs;
};
  • type: properties of this iterator stored as bit mask. e.g. READ/WRITE, etc.
  • iov_offset: offset of cursor w.r.t. to first iovec pointed by iov
  • count: total amount of data pointed to by iov
  • iov: an array of iovec objects (buffers)
  • nr_segs: number of arrays in iov

To transfer data between kernel space and user space, one can call

size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i);
size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);

Here addr is address in kernel space; bytes is size of buffer and i is the user-space buffer indicated by the iterator.

The file_operations also provides read_iter and write_iter options to allow a device to interact with user-space buffers using iovec. This is how packets are shipped from kernel space to user space.

TUN’s file_operations structure:

static const struct file_operations tun_fops = {
	.owner	= THIS_MODULE,
	.llseek = no_llseek,
	.read_iter  = tun_chr_read_iter,
	.write_iter = tun_chr_write_iter,
	.poll	= tun_chr_poll,
	.unlocked_ioctl	= tun_chr_ioctl,
#ifdef CONFIG_COMPAT
	.compat_ioctl = tun_chr_compat_ioctl,
#endif
	.open	= tun_chr_open,
	.release = tun_chr_close,
	.fasync = tun_chr_fasync,
#ifdef CONFIG_PROC_FS
	.show_fdinfo = tun_chr_show_fdinfo,
#endif
};

More on iov_iter:

TUN’s read implementation

static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
			   struct iov_iter *to,
			   int noblock, void *ptr)
{
	ssize_t ret;
	int err;

	if (!iov_iter_count(to)) {
		tun_ptr_free(ptr);
		return 0;
	}

	if (!ptr) {
		/* Read frames from ring */
		ptr = tun_ring_recv(tfile, noblock, &err);
		if (!ptr)
			return err;
	}

	if (tun_is_xdp_frame(ptr)) {
		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);

		ret = tun_put_user_xdp(tun, tfile, xdpf, to);
		xdp_return_frame(xdpf);
	} else {
		struct sk_buff *skb = ptr;

		ret = tun_put_user(tun, tfile, skb, to);
		if (unlikely(ret < 0))
			kfree_skb(skb);
		else
			consume_skb(skb);
	}

	return ret;
}

static ssize_t tun_chr_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
	struct file *file = iocb->ki_filp;
	struct tun_file *tfile = file->private_data;
	struct tun_struct *tun = tun_get(tfile);
	ssize_t len = iov_iter_count(to), ret;

	if (!tun)
		return -EBADFD;
	ret = tun_do_read(tun, tfile, to, file->f_flags & O_NONBLOCK, NULL);
	ret = min_t(ssize_t, ret, len);
	if (ret > 0)
		iocb->ki_pos = ret;
	tun_put(tun);
	return ret;
}

TUN’s write implementation

/* Get packet from user space buffer */
static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
			    void *msg_control, struct iov_iter *from,
			    int noblock, bool more)
{
	struct tun_pi pi = { 0, cpu_to_be16(ETH_P_IP) };
	struct sk_buff *skb;
	size_t total_len = iov_iter_count(from);
	size_t len = total_len, align = tun->align, linear;
	struct virtio_net_hdr gso = { 0 };
	struct tun_pcpu_stats *stats;
	int good_linear;
	int copylen;
	bool zerocopy = false;
	int err;
	u32 rxhash = 0;
	int skb_xdp = 1;
	bool frags = tun_napi_frags_enabled(tfile);

	if (!(tun->flags & IFF_NO_PI)) {
		if (len < sizeof(pi))
			return -EINVAL;
		len -= sizeof(pi);

		if (!copy_from_iter_full(&pi, sizeof(pi), from))
			return -EFAULT;
	}

	if (tun->flags & IFF_VNET_HDR) {
		int vnet_hdr_sz = READ_ONCE(tun->vnet_hdr_sz);

		if (len < vnet_hdr_sz)
			return -EINVAL;
		len -= vnet_hdr_sz;

		if (!copy_from_iter_full(&gso, sizeof(gso), from))
			return -EFAULT;

		if ((gso.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
		    tun16_to_cpu(tun, gso.csum_start) + tun16_to_cpu(tun, gso.csum_offset) + 2 > tun16_to_cpu(tun, gso.hdr_len))
			gso.hdr_len = cpu_to_tun16(tun, tun16_to_cpu(tun, gso.csum_start) + tun16_to_cpu(tun, gso.csum_offset) + 2);

		if (tun16_to_cpu(tun, gso.hdr_len) > len)
			return -EINVAL;
		iov_iter_advance(from, vnet_hdr_sz - sizeof(gso));
	}

	if ((tun->flags & TUN_TYPE_MASK) == IFF_TAP) {
		align += NET_IP_ALIGN;
		if (unlikely(len < ETH_HLEN ||
			     (gso.hdr_len && tun16_to_cpu(tun, gso.hdr_len) < ETH_HLEN)))
			return -EINVAL;
	}

	good_linear = SKB_MAX_HEAD(align);

	if (msg_control) {
		struct iov_iter i = *from;

		/* There are 256 bytes to be copied in skb, so there is
		 * enough room for skb expand head in case it is used.
		 * The rest of the buffer is mapped from userspace.
		 */
		copylen = gso.hdr_len ? tun16_to_cpu(tun, gso.hdr_len) : GOODCOPY_LEN;
		if (copylen > good_linear)
			copylen = good_linear;
		linear = copylen;
		iov_iter_advance(&i, copylen);
		if (iov_iter_npages(&i, INT_MAX) <= MAX_SKB_FRAGS)
			zerocopy = true;
	}

	if (!frags && tun_can_build_skb(tun, tfile, len, noblock, zerocopy)) {
		/* For the packet that is not easy to be processed
		 * (e.g gso or jumbo packet), we will do it at after
		 * skb was created with generic XDP routine.
		 */
		skb = tun_build_skb(tun, tfile, from, &gso, len, &skb_xdp);
		if (IS_ERR(skb)) {
			this_cpu_inc(tun->pcpu_stats->rx_dropped);
			return PTR_ERR(skb);
		}
		if (!skb)
			return total_len;
	} else {
		if (!zerocopy) {
			copylen = len;
			if (tun16_to_cpu(tun, gso.hdr_len) > good_linear)
				linear = good_linear;
			else
				linear = tun16_to_cpu(tun, gso.hdr_len);
		}

		if (frags) {
			mutex_lock(&tfile->napi_mutex);
			skb = tun_napi_alloc_frags(tfile, copylen, from);
			/* tun_napi_alloc_frags() enforces a layout for the skb.
			 * If zerocopy is enabled, then this layout will be
			 * overwritten by zerocopy_sg_from_iter().
			 */
			zerocopy = false;
		} else {
			skb = tun_alloc_skb(tfile, align, copylen, linear,
					    noblock);
		}

		if (IS_ERR(skb)) {
			if (PTR_ERR(skb) != -EAGAIN)
				this_cpu_inc(tun->pcpu_stats->rx_dropped);
			if (frags)
				mutex_unlock(&tfile->napi_mutex);
			return PTR_ERR(skb);
		}

		if (zerocopy)
			err = zerocopy_sg_from_iter(skb, from);
		else
			err = skb_copy_datagram_from_iter(skb, 0, from, len);

		if (err) {
			err = -EFAULT;
drop:
			this_cpu_inc(tun->pcpu_stats->rx_dropped);
			kfree_skb(skb);
			if (frags) {
				tfile->napi.skb = NULL;
				mutex_unlock(&tfile->napi_mutex);
			}

			return err;
		}
	}

	if (virtio_net_hdr_to_skb(skb, &gso, tun_is_little_endian(tun))) {
		this_cpu_inc(tun->pcpu_stats->rx_frame_errors);
		kfree_skb(skb);
		if (frags) {
			tfile->napi.skb = NULL;
			mutex_unlock(&tfile->napi_mutex);
		}

		return -EINVAL;
	}

	switch (tun->flags & TUN_TYPE_MASK) {
	case IFF_TUN:
		if (tun->flags & IFF_NO_PI) {
			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;

			switch (ip_version) {
			case 4:
				pi.proto = htons(ETH_P_IP);
				break;
			case 6:
				pi.proto = htons(ETH_P_IPV6);
				break;
			default:
				this_cpu_inc(tun->pcpu_stats->rx_dropped);
				kfree_skb(skb);
				return -EINVAL;
			}
		}

		skb_reset_mac_header(skb);
		skb->protocol = pi.proto;
		skb->dev = tun->dev;
		break;
	case IFF_TAP:
		if (!frags)
			skb->protocol = eth_type_trans(skb, tun->dev);
		break;
	}

	/* copy skb_ubuf_info for callback when skb has no error */
	if (zerocopy) {
		skb_shinfo(skb)->destructor_arg = msg_control;
		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
		skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
	} else if (msg_control) {
		struct ubuf_info *uarg = msg_control;
		uarg->callback(uarg, false);
	}

	skb_reset_network_header(skb);
	skb_probe_transport_header(skb);

	if (skb_xdp) {
		struct bpf_prog *xdp_prog;
		int ret;

		local_bh_disable();
		rcu_read_lock();
		xdp_prog = rcu_dereference(tun->xdp_prog);
		if (xdp_prog) {
			ret = do_xdp_generic(xdp_prog, skb);
			if (ret != XDP_PASS) {
				rcu_read_unlock();
				local_bh_enable();
				if (frags) {
					tfile->napi.skb = NULL;
					mutex_unlock(&tfile->napi_mutex);
				}
				return total_len;
			}
		}
		rcu_read_unlock();
		local_bh_enable();
	}

	/* Compute the costly rx hash only if needed for flow updates.
	 * We may get a very small possibility of OOO during switching, not
	 * worth to optimize.
	 */
	if (!rcu_access_pointer(tun->steering_prog) && tun->numqueues > 1 &&
	    !tfile->detached)
		rxhash = __skb_get_hash_symmetric(skb);

	rcu_read_lock();
	if (unlikely(!(tun->dev->flags & IFF_UP))) {
		err = -EIO;
		rcu_read_unlock();
		goto drop;
	}

	if (frags) {
		/* Exercise flow dissector code path. */
		u32 headlen = eth_get_headlen(tun->dev, skb->data,
					      skb_headlen(skb));

		if (unlikely(headlen > skb_headlen(skb))) {
			this_cpu_inc(tun->pcpu_stats->rx_dropped);
			napi_free_frags(&tfile->napi);
			rcu_read_unlock();
			mutex_unlock(&tfile->napi_mutex);
			WARN_ON(1);
			return -ENOMEM;
		}

		local_bh_disable();
		napi_gro_frags(&tfile->napi);
		local_bh_enable();
		mutex_unlock(&tfile->napi_mutex);
	} else if (tfile->napi_enabled) {
		struct sk_buff_head *queue = &tfile->sk.sk_write_queue;
		int queue_len;

		spin_lock_bh(&queue->lock);
		__skb_queue_tail(queue, skb);
		queue_len = skb_queue_len(queue);
		spin_unlock(&queue->lock);

		if (!more || queue_len > NAPI_POLL_WEIGHT)
			napi_schedule(&tfile->napi);

		local_bh_enable();
	} else if (!IS_ENABLED(CONFIG_4KSTACKS)) {
		tun_rx_batched(tun, tfile, skb, more);
	} else {
		netif_rx_ni(skb);
	}
	rcu_read_unlock();

	stats = get_cpu_ptr(tun->pcpu_stats);
	u64_stats_update_begin(&stats->syncp);
	u64_stats_inc(&stats->rx_packets);
	u64_stats_add(&stats->rx_bytes, len);
	u64_stats_update_end(&stats->syncp);
	put_cpu_ptr(stats);

	if (rxhash)
		tun_flow_update(tun, rxhash, tfile);

	return total_len;
}

static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
	struct file *file = iocb->ki_filp;
	struct tun_file *tfile = file->private_data;
	struct tun_struct *tun = tun_get(tfile);
	ssize_t result;

	if (!tun)
		return -EBADFD;

	result = tun_get_user(tun, tfile, NULL, from,
			      file->f_flags & O_NONBLOCK, false);

	tun_put(tun);
	return result;
}

Resources