As, raj mentioned, we have completed the integration of pools and PFS . Here s a detailed documentation of our work. For those who are still not clear about our work can take a look at this document :) click here to download
Just like the first step of any open source software, our filesystem has put up in sourceforge and we are waiting for more involvement in building this into a next generation filesystem for linux.
Its been long time since I blogged. Now here it comes......................
We never expected we'll finish this fast.
A very basic file system is done. We can now dynamically add disks into the pool.
The removing part is still there. for that we need to defragment the harddisk and so on.... tat job is postponed for next month...
I started with minix_fs. it was one simple filesystem. I also wrote PFS specific bit operations. [ in pfs_bitops.c]. After going through minix code, some 20 times, i understood all the functions in the itree_common.c. It was difficult though. Luckily, I needn't change these functions. I copied them as such. We made lot of changes in dir.c... Every thing went like this... Changing was no big problem. Everything got over by 20th of march.
Its been a while since the blog was updated. We lost focus for a bit. Got busy with other things. Anyway... The project is back on track. We made quite a lot of progress. We got the project approved at sourceforge. Once we have a minimal usable version, we'll put it there.
As for the details of the project. All kinds of verification has been done. Pools with the same name cannot be created. Same disk cannot be added to multiple pools. Pools can be renamed and deleted. We made sure that the Pools always have at least one disk. Its a pretty stable code now.
Since the pool is a virtual RAM based disk which just maps all the requests... It used to disappear every time the system is rebooted or the kernel module is reloaded. We have put in metadata into each disk which is a part of a pool. I was allowed the first 1024 bytes (ie. the boot block) to use for metadata.
As the initial plan was to store details of all disks and all pools into every participating disk... we had thought of limiting the maximim number of pools and disks in each pool. Then we reduced the metadata to storing details of all disk of only the corresponding pool in each disk. Though it allowed "infinite" number of pools... we still had to limit the number of disks in each pool.
Finally after much thought... we just reduces the metadata to this: signature, pool name, rank, random number, total disks, checksum. The string is just to check if the disk is part of any pool. We have used the word "pfs-inodz" as signature. ;-) Pool name to specify which pool it belongs to. rank says at what order the disk appears in the pools. random number is to fix an issue ( say... two pools are created with the same name in two different computers and then connected to the same computer. PFS could go crazy... but the random number filled in saves the day. It is the same for every disk in a pool.) Total disks is to make sure that all the disks in the pool are present and thus the pool is valid. Checksum is to check if the metadata is valid.
Then another mad problem was to find a way to iterate through all block_devices connected to the system. the internal list maintained by the kernel all_bdevs is not exported and so not accessible. So... the ioctl guy, santosh, writes this program to get the list of all devices from the output of the command "fdisk -l" and pass it to the module when it is inserted.
Now with that list.... I could read all the metadata from participating disks and recreate the pools everytime. TADAAA!!! Done.
Meanwhile... rajgopal was looking through the minix code. By the way... thats the file system we have decided to use as a base to work on (because of its comparitive simplicity of course). Made some progress there. Apparently, it is easier than we thought it would be. Modification of some 10 functions which handle the superblock should do the job.
The file system needs a lot of data from the pools. So we decided to integrate the two and run it as one single kernel module. We are integrating our pool code with the minix2 code now.
Anyway instead of me blabbering... he knows better about e file system now. I ll ask him to post.
At least it looks as though we ll be able to finish the project in a while well within the deadline... Fingers crossed.
With the level of confusion and cluelessness we had in the beginning of the project, we never expected the project would pick up such a pace. Guess thats how it is when it comes to linux kernel hacking. Unbelievably interesting work to do. Anyway... coming to the progress of the project, We are still working on the pools. We are done with lets say 85-90% of it. There were features and bugs which took much longer than we expected. Since the last blog update, we added these features successfully:
-Adding block devices to pools
-Removing block devices from pools
-Creating pools via IOCTL commands
-Multiple number of pools
-Names to pools
Adding and removing were pretty straight forward. We were stuck when we needed to write a code to return a block_device when the name of the device was given. We finally used this function called open_by_devnum(). Later when our job with the block_devices is over(exiting the module, deleting disks from pool, etc), we use the function blkdev_put() to close the devices.
Then, for creating pools, we decided to have an initial pool called /dev/pool which would recieve and execute all requests(IOCTL) to create pools.
-Writing metadata having information about all pools and constituent block devices to pools.
-Scanning all disks at startup to create pools.
-Minor work such as renaming and deleting pools.
-IOCTL commands to list all pools and their member block devices.
Now with that, would complete the work on the pools. Hopefully in another week. We'll probably put the code for the pool on the blog after that.
Then we have to work on the filesystem. A lot of study has to be done. We have no clear direction yet. We expect to get stuck for a long time. As he said... GOD HELP US WITH THAT.
This project is going really great. After the one to one mapping, the next plan was to make one to many mapping.
The pool disk will be the one in which the file system will reside on. This pool device internally maps to the block devices of the original harddisks.
Krishna and I worked hard like never before.
The one to many consists of these steps.
- Get the bio
- Find in which disk the bio starts.
- If bio starts and ends in the same device,
- then change the block_dev and sector and send_bio;
- else if the bio spans many devices,
- then split_bio and call this same function recursively
This module, while developing, by default spans only /dev/ram0,1. But we need the user to select the devices.
Here comes ioctl.
The only way user programs can communicate with driver is through IOCTL calls. Me and Krishna, now started concentrating on writing ioctls. We wrote the ioctl handling inside drivers. Santhosh started working on user commands. He creates c-programs, that get commands. Commands look something like,
$ pool add /dev/pool0 /dev/ram10 /dev/ram12
This means, Add /dev/ram10 and ram12 to /dev/pool0. This guy, gets the devices in command line arguement and put it in a datastructure. Finally, according to wat to be done (add, remove etc..), he passes the corresponding ioctl_command_number with the datastructure.
We get this datasructure inside the driver and we add the corresponding block_device objects of the devices to out list of devices handled by pool. Same applies for remove..
Some ioctl operations are left.. We are doing that.
Our next step will be, putting metadata on harddisks. We know nothing in this. Lots Lots of problem. As soon as we are done, I'll post the same here!
We've not yet touched the file system part. Problem lies there too....
GOD HELP US! :)
- one for physical on_disk pool. (when the pool info is saved in disks, we should take care of : 'this' data is from 'this' sector to 'this' sector. )
- one for memory (take information from physical disks and put it in the pool object in memory which is handy for programming.)
- How can a disk (partition or physical disk or whatever) be uniquely identified ? The device address (like /dev/sda) might change if you reboot.
- When the computer is rebooted, our pool should know what all disks it was handling before. For this we need to store the pool-device relationship somewhere permanently (you cant store it in disk which is participating in pool because it can be removed). Where can I save that ?
- (!) Lot of things become easy if we keep a maximum limit on the disk that can participate in the pool ( say 256 ).. May be there are ways to handle infinite number of disks, but to begin with, the first version will have 256 as limit.
Pool is the device over which the file system is going to operate on. So, The pool has to redirect the requests that it gets to the original devices below it. I was bit stuck with "how can this communication be achieved ?"................
I was thinking of EXPORTING the transfer functions globally, so that, the pool's job as simple as to call the transfer functions. But is this generic ?
* I've EXPORT them.,
* Recompile the driver
* Also, you cant do this for all the drivers
COMPLETELY NOT GENERIC!
Then, Hari Helped!!!!!
He told me there are several ways by which u can achieve this without EXPORTING. one of the way is submit_bio(). I read in LDD3, "If you want to redirect, you change the bio->bi_bdev, and resubmit the bio".... [ GREAT!!! ]
But how to get the block_device object of a device....??
* Path_lookup the device
* Get the inode of the device from nameidata
* Get the dev_t object fom the inode
* open_by_devnum and get the block_device [GREAT!!!!]
ok... Got the block_device....
I changed the block_device and submited the bio
tried tried tried tried tried....... 3 days of trying,
Hari pointed out the bug was there in bio_endio, but i din know what exactly the bug is.
05-01-2008, about 4.30 pm, evrery thing got so clear....
what i thought was,
* submit_bio() returns only after performing the whole I/O operation.
* so after the submit_bio, the bio is a waste
* i killed it after submit_bio
But the thing is,
* submit_bio retuns after "JUST PUTTING THE BIO IN THE REQUEST QUEUE OF THE OTHER DEVICE"
* not knowing this, i was killing the bio (which was still in the request queue)
i wrote my bi_end_io function and did all ending operations there.........
The code worked. Whatever operations tat i did in the pool, got reflected in ram0... Thankyou hari for all your help.! This project is going awesome..... Lots of learning........
NEXT_STEP : one-to-many mappping...