Finally we are ready with our very first version of PFS for linux.

hello everyone...
As, raj mentioned, we have completed the integration of pools and PFS . Here s a detailed documentation of our work. For those who are still not clear about our work can take a look at this document :) click here to download



Just like the first step of any open source software, our filesystem has put up in sourceforge and we are waiting for more involvement in building this into a next generation filesystem for linux.

Posted in | 1 comments

Stick a fork in me.. I'm done !

Its been long time since I blogged. Now here it comes......................

We never expected we'll finish this fast.
A very basic file system is done. We can now dynamically add disks into the pool.
The removing part is still there. for that we need to defragment the harddisk and so on.... tat job is postponed for next month...

I started with minix_fs. it was one simple filesystem.  I also wrote PFS specific bit operations. [ in pfs_bitops.c].  After going through minix code, some 20 times, i understood all the functions in the itree_common.c.  It was difficult though. Luckily, I needn't change these functions. I copied them as such. We made lot of changes in dir.c... Every thing went like this... Changing was no big problem. Everything got over by 20th of march.


There were some zillion bugs in the file system... There was this bug wat friends called INFINITE FILE SYSTEM. The file syste's size was unlimited... u can fed the disk even with terrabytes of data.. [ofcourse the older the data gets, the more it got corrupted ] Hmm... It took 2 days to fix that bug completely.! 

Anyway.. After all this, The PFS is now bug free.! The dynamic shrinking part is left.! The integation of the file system and the pool is there (not a big job.!). 

We are planning to sourceforge the project soon. We need a perfect documentation etc for this. so this will be done only by next month i guess...

Things tat are left in our project.
* Integration of PFS and POOL.
* Benchmarking.
* Documentation.

Optional modules ( if we have time)
* De fragmenting [ dynamic removal ] .
* Minix - to - PFS porting module.

Posted in | 0 comments

Pools completed!!! File system under construction...

Its been a while since the blog was updated. We lost focus for a bit. Got busy with other things. Anyway... The project is back on track. We made quite a lot of progress. We got the project approved at sourceforge. Once we have a minimal usable version, we'll put it there.

As for the details of the project. All kinds of verification has been done. Pools with the same name cannot be created. Same disk cannot be added to multiple pools. Pools can be renamed and deleted. We made sure that the Pools always have at least one disk. Its a pretty stable code now.

Since the pool is a virtual RAM based disk which just maps all the requests... It used to disappear every time the system is rebooted or the kernel module is reloaded. We have put in metadata into each disk which is a part of a pool. I was allowed the first 1024 bytes (ie. the boot block) to use for metadata.

As the initial plan was to store details of all disks and all pools into every participating disk... we had thought of limiting the maximim number of pools and disks in each pool. Then we reduced the metadata to storing details of all disk of only the corresponding pool in each disk. Though it allowed "infinite" number of pools... we still had to limit the number of disks in each pool.

Finally after much thought... we just reduces the metadata to this: signature, pool name, rank, random number, total disks, checksum. The string is just to check if the disk is part of any pool. We have used the word "pfs-inodz" as signature. ;-) Pool name to specify which pool it belongs to. rank says at what order the disk appears in the pools. random number is to fix an issue ( say... two pools are created with the same name in two different computers and then connected to the same computer. PFS could go crazy... but the random number filled in saves the day. It is the same for every disk in a pool.) Total disks is to make sure that all the disks in the pool are present and thus the pool is valid. Checksum is to check if the metadata is valid.

Then another mad problem was to find a way to iterate through all block_devices connected to the system. the internal list maintained by the kernel all_bdevs is not exported and so not accessible. So... the ioctl guy, santosh, writes this program to get the list of all devices from the output of the command "fdisk -l" and pass it to the module when it is inserted.

Now with that list.... I could read all the metadata from participating disks and recreate the pools everytime. TADAAA!!! Done.

Meanwhile... rajgopal was looking through the minix code. By the way... thats the file system we have decided to use as a base to work on (because of its comparitive simplicity of course). Made some progress there. Apparently, it is easier than we thought it would be. Modification of some 10 functions which handle the superblock should do the job.

The file system needs a lot of data from the pools. So we decided to integrate the two and run it as one single kernel module. We are integrating our pool code with the minix2 code now.

Anyway instead of me blabbering... he knows better about e file system now. I ll ask him to post.

At least it looks as though we ll be able to finish the project in a while well within the deadline... Fingers crossed.

Posted in | 0 comments

Heading towards completion of POOLS :)

With the level of confusion and cluelessness we had in the beginning of the project, we never expected the project would pick up such a pace. Guess thats how it is when it comes to linux kernel hacking. Unbelievably interesting work to do. Anyway... coming to the progress of the project, We are still working on the pools. We are done with lets say 85-90% of it. There were features and bugs which took much longer than we expected. Since the last blog update, we added these features successfully:
-Adding block devices to pools
-Removing block devices from pools
-Creating pools via IOCTL commands
-Multiple number of pools
-Names to pools

Adding and removing were pretty straight forward. We were stuck when we needed to write a code to return a block_device when the name of the device was given. We finally used this function called open_by_devnum(). Later when our job with the block_devices is over(exiting the module, deleting disks from pool, etc), we use the function blkdev_put() to close the devices.
Then, for creating pools, we decided to have an initial pool called /dev/pool which would recieve and execute all requests(IOCTL) to create pools.

TO-DO list:
-Writing metadata having information about all pools and constituent block devices to pools.
-Scanning all disks at startup to create pools.
-Minor work such as renaming and deleting pools.
-IOCTL commands to list all pools and their member block devices.

Now with that, would complete the work on the pools. Hopefully in another week. We'll probably put the code for the pool on the blog after that.
Then we have to work on the filesystem. A lot of study has to be done. We have no clear direction yet. We expect to get stuck for a long time. As he said... GOD HELP US WITH THAT.

One to many mapping and IOCTL.

This project is going really great. After the one to one mapping, the next plan was to make one to many mapping.
The pool disk will be the one in which the file system will reside on. This pool device internally maps to the block devices of the original harddisks.

Krishna and I worked hard like never before.
The one to many consists of these steps.

  • Get the bio
  • Find in which disk the bio starts.
  • If bio starts and ends in the same device,
  • then change the block_dev and sector and send_bio;
  • else if the bio spans many devices,
  • then split_bio and call this same function recursively
This had lots of bugs. we spent some 2 days to make the code bug free (at least as far as we tested its bug free)
This module, while developing, by default spans only /dev/ram0,1. But we need the user to select the devices.

Here comes ioctl.

The only way user programs can communicate with driver is through IOCTL calls. Me and Krishna, now started concentrating on writing ioctls. We wrote the ioctl handling inside drivers. Santhosh started working on user commands. He creates c-programs, that get commands. Commands look something like,

$ pool add /dev/pool0 /dev/ram10 /dev/ram12

This means, Add /dev/ram10 and ram12 to /dev/pool0. This guy, gets the devices in command line arguement and put it in a datastructure. Finally, according to wat to be done (add, remove etc..), he passes the corresponding ioctl_command_number with the datastructure.

We get this datasructure inside the driver and we add the corresponding block_device objects of the devices to out list of devices handled by pool. Same applies for remove..

Some ioctl operations are left.. We are doing that.
Our next step will be, putting metadata on harddisks. We know nothing in this. Lots Lots of problem. As soon as we are done, I'll post the same here!
We've not yet touched the file system part. Problem lies there too....

GOD HELP US! :)

Posted in | 0 comments

Thinking of Pool data structures..

For the past 2 days, I and Krishna were thinking about the structure of the pool. Went well. Many ideas came up. To handle pool, there should be 2 data structures.
  • one for physical on_disk pool. (when the pool info is saved in disks, we should take care of : 'this' data is from 'this' sector to 'this' sector. )

  • one for memory (take information from physical disks and put it in the pool object in memory which is handy for programming.)
These things are going fine. Once they are clearly defined, I'll put them up here. We also came up with a few problems we might be facing.
  1. How can a disk (partition or physical disk or whatever) be uniquely identified ? The device address (like /dev/sda) might change if you reboot.

  2. When the computer is rebooted, our pool should know what all disks it was handling before. For this we need to store the pool-device relationship somewhere permanently (you cant store it in disk which is participating in pool because it can be removed). Where can I save that ?

  3. (!) Lot of things become easy if we keep a maximum limit on the disk that can participate in the pool ( say 256 ).. May be there are ways to handle infinite number of disks, but to begin with, the first version will have 256 as limit.

Posted in | 0 comments

ONE - to - ONE mapping... SUCCESS

Pool is the device over which the file system is going to operate on. So, The pool has to redirect the requests that it gets to the original devices below it. I was bit stuck with "how can this communication be achieved ?"................
I was thinking of EXPORTING the transfer functions globally, so that, the pool's job as simple as to call the transfer functions. But is this generic ?
* I've EXPORT them.,
* Recompile the driver
* Also, you cant do this for all the drivers

COMPLETELY NOT GENERIC!

Then, Hari Helped!!!!!

He told me there are several ways by which u can achieve this without EXPORTING. one of the way is submit_bio(). I read in LDD3, "If you want to redirect, you change the bio->bi_bdev, and resubmit the bio".... [ GREAT!!! ]
But how to get the block_device object of a device....??

Hari Helped!!!!!

* Path_lookup the device
* Get the inode of the device from nameidata
* Get the dev_t object fom the inode
* open_by_devnum and get the block_device [GREAT!!!!]

ok... Got the block_device....
I changed the block_device and submited the bio
KERNEL PANICS!!!!!!!
tried tried tried tried tried....... 3 days of trying,
Hari pointed out the bug was there in bio_endio, but i din know what exactly the bug is.
05-01-2008, about 4.30 pm, evrery thing got so clear....

what i thought was,
* submit_bio() returns only after performing the whole I/O operation.
* so after the submit_bio, the bio is a waste
* i killed it after submit_bio

But the thing is,
* submit_bio retuns after "JUST PUTTING THE BIO IN THE REQUEST QUEUE OF THE OTHER DEVICE"
* not knowing this, i was killing the bio (which was still in the request queue)

FINE!
i wrote my bi_end_io function and did all ending operations there.........
TADAAAAAA.....
The code worked. Whatever operations tat i did in the pool, got reflected in ram0... Thankyou hari for all your help.!
This project is going awesome..... Lots of learning........

NEXT_STEP : one-to-many mappping...

Posted in Labels: , | 0 comments

CREATING POOLS.......

This is how i'm gonna implement pools......

* Every disk that is present in the system is controlled by a driver.
* every driver has a request queue.

I will create a virtual device - POOL. Pool's size = sum of sizes all the devices tat forms the pool.
any request that comes to my Pool's request queue will be forwarded to a corresponding physical device
-------------------------
| DEV-A | DEV-B | DEV-C | > dev_name
| 0-100 | 0-100 | 0-100 | > size in sectors
-------------------------
........................||...........................
........................||...........................
.......................\_/........................
........................\/.........................
---------------------------
| .........POOL ..........|
| ......... 300 ..........|
---------------------------


so, this pool is only visible to the user.
Any I/O request to the POOL should be converted to I/O request of the device.
Example : any request to read the 125'th sector of the pool, should be a request to read the DEV-B's 25th sector.

please see,
http://lwn.net/Articles/58720/

In this device driver,
static void sbd_transfer(); does the copying job from buffer to disk....

I need to modify this sbd_transfer(), such that,
{
if(sector>= 0 && sector <100)  class="Apple-tab-span" style="white-space:pre"> i/o request should be sent to DEV-A's driver...


if(sector>= 100 && sector <200) class="Apple-tab-span" style="white-space:pre"> i/o request should be sent to DEV-B's driver...


if(sector>= 200 && sector <300) class="Apple-tab-span" style="white-space:pre"> i/o request should be sent to DEV-C's driver...

}
This is the algorithm

Questions.

* Is this possible ?
* If this is possible, how can we enable communication between Drivers?
[pool's driver need to pass a read/write request to DEV-B.. HOW?]

Posted in | 1 comments

Modules of the Project...

PFS will have 2 main modules.....
1. Pool manager
2. File system.

* Pool manager is the one tat should replace the volume manager.
The Pool Manager should handle multiple disks as a single logical pool. the size of that logical pool should not be limited. the size should grow when another disk is added to the pool, it should decreased when a disk is removed.! It should provide various functions to the file system tat will be sitting above this.

* the File System - PFS is the one tat interacts with the Pool Manager to get the data written into the hard disk. For this, the FS makes use of the methods provided by the pool manager. This interacts with the upper level layers like system calls etc.

Posted in Labels: , , , , | 0 comments

WHY PFS ?

Here is a detailed Description of the PFS...

Current filesystems have this problem : LIMITED SIZE...
u allocate certain size for a file system, format a file system, done.!

say,
in a 50 hard disk, u have 3 logical partitions
MOVIES | PROJECTS | DOCUMENTS..
20GB........20GB...........10GB

* now consider MOVIES partion is full
* there is 15GB free space in PROJECTS
* still u cant add even a single file in MOVIES !!!

Why not make a file system tat can dynamically expand and shrink...
We'll eliminate the term "VOLUMES, PARTITIONS"... We'll make POOLS!!!

Now, how about,
* MOVIES , PROJECTS, DOCUMENTS share the same 50GB.
* there is no seperate size for each of these.
* data can be added into any of these, till the whole 50GB gets full...
* still MOVIES , PROJECTS, DOCUMENTS can be accesed as different file systems
* they can be mounted, and un mounted seperately..

Now,
* u buy a new harddisk (120 GB)
* u plug-in the harddisk...
* boot ur system...
* "a single command" (pool -add /dev/my_new_harddisk already_existing_pool)
* NO-RE-FORMATTING ur harddisk
* [new space 50 + 120 GB]
* data can be added into any of MOVIES , PROJECTS, DOCUMENTS , till the whole 50GB + 120GB gets full...

it is already implemented in,
ZFS......... Open solaris
ZFS-on-FUSE. LINUX
LVM2 can do this (but not as a readymade file system)

Posted in Labels: , , , | 2 comments