Sunday, June 6, 2010

1.a) Scan Photos

In the last post, I described six major steps in the plan for building a Media Archive.  This post details the first step: Scanning Photos.  This will be a bit technical and most likely a dry read for most readers.  The purpose for highly detailed, technical posts like this is to document the process for I followed for each step and how each step fits into the project as a whole.

Scanner Settings
When began this project, the first thing I did was to begin experimenting with a scanner I already had lying around.  I chose a picture and made some scans at various values of Dots per Inch (DPI), up to the scanner's maximum of 300 DPI.  While this setting might be fine for hosting the resulting images on the web or possibly re-printing them,  I wanted to be sure that even cropped, or zoomed in, or blown up versions of my scans could be usable without artifacts.   So I bought a Canon CanoScan 8800f scanner with a maximum optical resolution of 4800 x 9600 DPI.  Scanning equipment is usually distributed with software drivers that allow your image processing programs to use all the features of your particular model of scanner.  The next section describes the scan settings I chose for this project.

Dots Per Inch (DPI)
The new scanner provided quite an increase in resolution.  In fact, 4800 DPI scans turned out images much larger than amount of memory in my little old laptop (Which the computer didn't like so much because of thrashing).  The extreme number of pixels resulting from larger prints forced me to run some experiments to determine the pint at which the increases in DPI resulted in no noticeable increase details captured (even at extreme zooms).  This was a bit of an eyeball estimate based on a few sample pictures but for this effort, I decided on 2400 DPI.

Yes, I know.  This setting results in images that are beyond overkill for viewing on the computer screen.  They're also well beyond what the typical accepted resolution for re-printing (300).  However, imagine a photo that was not well composed.  Lets say most of the scene was empty background and you wanted a full-size print of just the 1/8 of the image containing the subject's face.  If the master scan was at 2400 DPI, the interesting quarter of the image could be blown up and printed (at 300 DPI) with very little if any pixelation visible.

This theme runs through every aspect of this project: store more than is generally of practical use now to allow future uses without the need to re-scan the original prints. It is trivial for software to downscale a property of a master for a specific use but the reverse is hard (and often impossible).  For example, I even scanned "Black and White" prints as color rather than grayscale.  This provides a consistent output for the master images and allows for color correction later for images scanned from faded prints.

Color Depth
Similarly, I selected more than is needed for the Color Depth of the scans.  One of the features of the CanoScan 8800f is 48-bit Color Depth scanning.  Even though very few computer display hardware (video cards, cables, monitors...) support color depths above 24-bit, there may be future applications that take advantage of colors of such precision.  For almost all practical uses of the images, a reduction of resolution will be required anyway.  The color depth can be dropped for those copies at the same time with little extra effort.

Other Settings
Scanner software and drivers often provide automatic features such as dust removal and color correction.  I turned all of these extra filters off except "Auto Tone" (color correction).  The others attempt to perform automatic "touch-ups" that should not really be applied to the master scans.  Other 3rd party photo editing applications offer similar tools with finer control over the effects.

Scan Profiles
The scanner drivers provided a way to save settings to named profiles.  This was helpful because it meant I could define the scan settings I wanted once and call up that profile for each image to be scanned.  This provided consistency so that I would never accidentally forget to change one of the settings from the default.  This is the profile I used for most prints:

Profile: Archive Photo
DPI: 2400
Color Mode: "Color(48-bit)"
Auto Tone: ON
[All Other Features]: OFF

As I moved through images, I created set of profiles to use under various conditions.  I found that some prints had potentially interesting captions or dates hand-written onto the back.  Recording these didn't require the full quality that I had defined for the images.  So I created a second profile:

Profile: Back Photo
DPI: 300
Color Mode: "Color" (meaning 24-bit color depth)
Auto Tone: ON
[All Other Features]: OFF

The DPI and Color Depth settings I chose resulted in massive numbers of pixels for even moderately sized prints.  For example a 4x6 print scanned at 2400 DPI results in 9,600 x 14,400 = 138,240,000 pixels.  The CanoScan 8800f has a limit of 10,000 x 30,000 pixels when scanning in 48-bit mode (something I did not see on the declared specifications!).  When scanning a 4x6 print, this limit is easily overcome by simply laying the print on the scanner bed in the correct orientation.  But my initial batch of photos had many prints of non-standard sizes; some of which were over the limit in both dimensions.  For these cases, I created a third profile:

Profile: Big Archive Photo
DPI: 2400
Color Mode: "Color" (meaning 24-bit color depth)
Auto Tone: ON
[All Other Features]: OFF

In cases where I was forced to use the "Big Archive Photo" profile, I attempted to also performed a second scan using the "Archive Photo" profile but cropped the image to fit within the scanner limits.  For prints where too much had to be cropped in order to fit within the limits (e.g. an 8x10 print), I resigned to only using the "Big Archive Photo" profile.

The photo set I was working through, also contained some copies of patents owned by my grandfather.  Similar to the backs of the prints, these texts did not require the full resolution provided by the "Archive Photo" profile.  For these, I created a fourth profile:

Profile: Archive Text
DPI: 600
Color Mode: "Color" (meaning 24-bit color depth)
Auto Tone: ON
[All Other Features]: OFF

Scanning Software Tools
In the last post, I outlined a process in which raw master scans are kept unmodified.  Since this post is about the scanning step, I'll focus here on the software tools used to perform the scans.  I'll devote a future post to creating touched-up versions of the master scans and the image processing tools I used.

Even though photo manipulation is performed later, the scanner drivers must be called from within some kind of image processing software (Windows now has scanning capability built in but this is still a kind of software).  Scanners are usually shipped with a bundle of image software tools to use with your scanner.  These are intended to provide the buyer everything they need to begin using their scanner right away.  These third party applications can be sufficient for casual scanning and photo manipulation tasks.  But, since different manufacturers include products developed by different vendors, it is difficult to speak to the quality of these bundled applications in general.  Personally, I usually stay away from them.   

Usually when I want to view or perform simple edit images, I reach for my favorite image processing tool: Irfanview.  However, I found that when used for scanning, the resulting images (even before saving to a file) were being automatically reduced to 24-bit color.  My favorite photo organizer software Google Picasa wouldn't allow me to scan at 48-bit color and stored the images in the information-lossy JPEG file format.

I finally settled on using and old version of Adobe Photoshop (I used an old version because it is what I had and while Photoshop is incredibly powerful, it is also priced to match). The scans were not stored in an intermediate information-lossy format or reduced color depth.  And, since I am not performing any of the actual image manipulation at this stage, an older version of the tool works just fine for scanning and saving.

I only performed one type of manipulation on the raw scans before saving them.  Many of the photos had a border area that was not printed on.  I cropped these borders out as close as possible.  Most of the prints turned out to not be perfectly square at high magnification (I used 300% zoom to set the crop lines), so small amounts of border are sometimes visible.

I even took extra effort to place images at right angles on the scanner bed.  This sometimes required two or three test scans at lower resolution to set the prints just right.   This is because the scanner "eye" does not "see" all the way to the edge of the scanner bed.  Placing a print with no border along any edge of the scanner bed would miss scanning a sliver of the print along that edge .  By taking the extra effort required to place prints squarely (by trial end error), I was able to avoid digitally rotating the images before saving them; further ensuring that the information stored in the master images is the exact unmodified set of pixels captured by the scanner.

Saving the Master Images
After the print is scanned, the master image must be saved to a file.  Future posts discuss the storage (1.b) and indexing (1.c) of the master image files.  For now, assume that there is a place to save the images and that each print is assigned an index name.  Saving the scanned image requires two more decisions: What to name the file and in which image file type to save the image data.

File Names
The basic approach is to save each master image with a file name that is based on the index assigned to the print.  This way, the print or the master image can be found by using the index marked on the other.  However, some prints produced multiple master images (See 'Scan Profiles' above).  For these, file name modifiers are added after the index to indicate the part of the print scanned.

Part of Print ScannedScan ProfileModified Image Name
Main print image with the border cropped off   Archive Photo   [Index]
Dates / Captions from the back  Archive Back   [Index]_back
Dates / Captions from the border Archive Back [Index]_border
Cropped scan of images too large for Archive Photo  Archive Photo [Index]_crop48
Full scan of images too large for Archive Photo profile Big Archive Photo [Index]_full24
Page of text Archive Text [Index]_text_p#

Image File Type
This section caused some trouble for me.  There are many image file types with a great deal of differences between them.  For saving the master images, the file type must be a well accepted standard and supported by many different software tools.  This increases the chances that the files saved now will still be readable by  software whenever they may be needed in the future.  The image file type must not use any kind of lossy compression (like JPEG). 

I narrowed the selection down to three different formats:
JPEG2000 is a new version of the JPEG standard which includes a lossless compression option.  In my testing, this format compressed the images the most.  However, it never seemed to catch on as a common file type.  Not many software editors support it.  The ones that do, usually need some kind of plug-in in order to obtain that support.

Portable Network Graphics (PNG) is a format that is intended to replace Graphics Interchange Format (GIF).  In my testing, this format did not compress quite as well as JPEG2000.  However, it is supported by almost all image viewers, editors, and web browsers.

Photoshop Document (PSD) is the native file format of Adobe Photoshop.  It supports the editing features provided by Photoshop (including layers).  However, the resulting files are very large. 

I eventually decided to store the master images in the PSD format.  It allows for more options for saving during later editing and touch-up steps.  Plus, storage is so inexpensive, the space saved by the compression of the other formats is not worth the extra effort and limitations.

This was a long and rambling post.  The next post will be a step-by-step procedure to follow while scanning.

Saturday, May 2, 2009

1) Media Archive Plan

In the last post, I described the three major components of the project. This post describes my general plan for constructing the first component: The Media Archive.

1.a) Scan Photos

The first step in the whole project is to scan family photos into a digital format. My wife and I have a few prints from when we were growing up but they are relatively newer prints than most of those held by other members of the family. I have decided to begin with photo sets passed down from my grandparents. Thanks to my aunt Linda, who lent me her collection, I have been able to begin experimenting with scanners and software.

I'll save the technical details of the scanning tools, settings, and procedures for a future post (you can't wait can you?). The general idea is to capture the images at a very high resolution and color depth with all of the "extras" provided by the scanner software (e.g. sharpening, dust & scratch removal) turned off.

The resulting digital image is saved in a file format that uses either no compression or lossless compression. Many popular graphical file formats (including JPEG) use lossy compression which sacrifices some of the information contained in the file in order to greatly reduce the file size. While most of the time the loss of information goes unnoticed (even with more than 90% of the information removed), deleting information runs counter to the entire purpose of the archive: to preserve as much information as possible.

All of this is done so that I obtain digital versions that are as close to the original print as possible.

1.b) Store Master Images

Because of all of the choices made in the scanning step, the resulting files are so large as to be completely impractical for direct uses (especially for web-based purposes). Instead, the raw scans are kept unmodified as master versions. Only copies of the master files are "touched-up" and scaled down to practical sizes. This way, all the information that was captured during the scan is always maintained at the master.

Besides storage is cheap! Today, you can get hard drives at about $0.08 per gigabyte and it will only get cheaper. Even if each image took up a whole gigabyte, you could store 2,500 images for only $200.

Of course, it's not that simple. Hard drives have moving parts which eventually wear out. A hard drive lures you into a sense of security until you have all of your important information stored on it with no backups and then fails catastrophically. Automatic data redundancies and regular backups are critical components of the Media Archive. A future post will explain how I plan to store

1.c) Index Originals

Occasionally, it may be necessary to retrieve the master file for an image. Perhaps to create a new print or blown up version. Similarly, the original print might be required. An image indexing system is required for either of these situations.

Each image created from the archive should be linked back to the master image and the original print. The name of the file could be used to associate an image with it's master but file names are easily and often changed. Some image file formats contain fields for storing metadata which could be used to store the indexing information. However, with this method, some table would be required to look up the image file name of the master from the information contained in the metadata of an image.

Additionally, the original prints must be stored in a way that allows them to be retrieved through the index. An index number could be written in archival ink onto the back of each image as it is scanned. Better yet, index cards or table of contents pages could be used to identify the index for small groups of prints.

1.d) Touch up & Downscale

With the master copies and original prints indexed and safely stored, the next step is to create usable versions. Any filters to improve the image are applied at this step (e.g. dust & scratch removal, sharpen, further cropping ...). If extensive work is required, the product is first saved in the same format as the master and archived with it. This way the touch ups do not need to be reapplied for any future versions made from that master.

There are many ways that the images can be scaled down to practical sizes. For example, black and white images can be converted to grayscale. The resolution can be reduced and the image stored in a file format that includes compression (even lossy compression if desired).

1.e) Publish

The downscaled images should then be employed in an accessible way. This might include digital photo frames, photo organization software, or online photo streams for sharing with the world.

1.f) Tag

The last step for a photo in the Media Archive is to be tagged with descriptive information. Generally, photo organization software and websites provide a slot for a text description of what is depicted.

However, most modern options provide a tag cloud feature. This allows photos to be assigned a set of descriptive phrases. This feature is used to identify individuals, places, or things in pictures and automatically associate that image with others tagged similarly.

I will not recognize the people and places depicted in all of the pictures. But, by publishing them on the web, I can leverage the collective knowledge the whole family in tagging media in the archive.

Ok, I know that was a lot of information all at once but it provides a general overview of the design of the Media Archive.

Thursday, April 30, 2009

Project Overview

The general goal of this project is to collect and safely preserve the history of my family. In the beginning, this project will focus on previous generations however, it will be designed to be a living archive. Growing with the family as time goes on.

What do I mean by family history? The specifics of this project continue to change as it is executed but there are a few obvious starting points for this project:

1) A Media Archive
Sure, digital cameras have made it easier to take and enjoy family photos. They'll never degrade and if you're good about backing up your data, you can even be reasonably sure that they are safely stored for the long term.

As amazing as this sounds to all you kids Tweeting and swapping pics with cell phones, photos used to only come printed on paper. Sure, you got the negatives too, but those never seem to make it through spring cleaning. This leaves collections of prints stashed away in albums, boxes, envelopes, and bags. When not cared for, they often fade, scratch, crack, stick together, rip and generally self destruct over time. While there are steps that can be taken to safely store prints, it's usually a losing battle against time.

Then there is the unexpected forces of nature such as fires and floods. You almost want to keep your originals in a climate controlled vault to protect them. Of course they're of little use locked away from view.

The solution is to scan the originals into a digital format. Not only does it drastically increase the chances that the photos will survive, it also allows people to share them more easily. Since the digital versions of the images are so easily accessible and shareable, the originals can then be stored safely away.

At the beginning of this project I will focus on images but I have already done preliminary work capturing old home movies to a digital format as well.

2) A Family Tree
A media archive is one part of the preservation of family history. But, without knowing who the people are and how they are related, the images convey little meaning. A family tree represents this type of information nicely. And the act of building a family tree can be a family activity that teaches younger generations about the family's past.

Manuel Turlin, "Turlin family tree", June 21 2009 viaWikimedia Commons, Creative Commons License

But, a family tree by itself doesn't provide much of a connection to who the people are (or were). My goal is to integrate the Media Archive with the Family Tree. Imagine the following scenario:

You are browsing the Media Archive and run across an interesting picture of your grandfather as a young man. He's standing with someone you don't recognize so you check the image's metadata to find out who he is. He turns out to be a member of the family so his name links into the Family Tree where you see his relationship to your grandfather (and to you). You read about where he lived and maybe what he did for a living. From there, you click through to his gallery on the Media Archive and begin browsing other images depicting him.
Rinse, repeat.

You've moved back and forth between the two systems seamlessly. The person shown standing with your grandfather now feels like a real person to you.

3) The Stories
In the scenario described in the last section, a fairly significant chunk of information was conveyed. But even with images and biographic information, you still don't get the whole picture. What was really going on in that picture?

You MIGHT know where it was taken if the image was tagged with that information. But why was your grandfather there? Why was he standing with that other member of the family? Was it a family reunion or did they spend every summer together?

These types of information don't typically fit in categories of metadata. They make up the story that led up to the picture being taken. You might expect this kind of information to be in the caption of each picture. But, usually a caption is fairly short and after writing enough of them, they tend to take the form: "Jon and Jenn with their Grandparents." Not very helpful...

This is part of the project I have put the least amount of research into but I expect to be the hardest to implement. Even if I am able to find the right technology to support it (comment with suggestions), gathering the stories from those who remember them is going to be tough. They are likely to be sparse and require a lot of time to capture and I know we all have very busy lives.

The plan is to put together a system that allows information to be entered by many people over time. I can't possibly sit down with everybody to get pictures, biographic info, and stories. But, if the system allows everybody to contribute what they know, the whole idea becomes much more feasible.

Sunday, April 26, 2009

What is this blog for?

In April of 2009, I finally began an effort of digitizing old family photos, documents, and memories. This was a project that I had been intending to tackle for quite a while. In fact, I had previously made a handful of attempts. Each time, the project got reshuffled lower on the priority list when things got busy (as they always seems to do) and I would eventually stop working on it altogether. A future post will explain some of the motivations and mechanisms I am employing to accomplish more focus this time around.

I plan to use this blog to document the project as I work through it. There are many reasons for creating this type of documentation. As a software engineer, I recognize the general need to document your work. The rest of this post will explain how I plan to use this blog.

One of the major consumers of this blog will be myself. I have a terrible terrible memory and will record details of my decision making process here. Basically, I want to make sure I know what the heck I was thinking when I look back at certain decisions along the way. Hopefully, this will help me avoid doing something that makes a planned step or intended use of the archive impossible.

I also plan to use this blog as a collection of technical instructions and procedures that I follow. This way, when I return to the project after a break (hey, life happens), I will remember where I left off and what to do next. This also enables me to accept help offered by others (post a comment to get involved) while ensuring consistency across the whole project.

Lastly, this blog will be used to discuss the project with others. It will keep those interested up to date on my progress and help coordinate the efforts of those participating directly in the project. It will also help me connect to those who have experience working on similar projects and allow me to benefit from their expertise.

These are some of the major reasons this blog exists. The next post will describe more detail about what type of archive I am attempting to create and the components I have selected so far.