Menu Close

Category: search

Creative Funding for Text-Mining and Visualization Project

The Hip-Hop word count project on Kickstarter.com caught my eye because it seems to be a really interesting new model for funding a digital humanities project. You can watch the video below – but the core of the project tackles assorted metadata from 40,000 rap songs from 1979 to the present including stats about each song (word count, syllables, education level, etc), individual words, artist location and date. This information aims to become a public online almanac fueled by visualizations.

I am a backer of this project, and you can be too. As of the original writing of this post, they are currently 47% funded twenty-eight days out from their deadline. For those of you not familiar with Kickstarter, people can post creative projects and provide rewards for their funders. The funding only goes through if they reach their goal within the time limit – otherwise nothing happens, a model they call ‘all-or-nothing funding’.

What will the money be spent on?

  • 45% for PHP programmers who have been coding the custom web interface
  • 35% for interface designers
  • 10% for data acquisition & data clean up
  • 10% for hosting bills

They aim for a five month time-line to move from their existing functional prototype to something viable to release to the public.

I am also intrigued by ways that the work on this project might be leveraged in the future to support similar text-mining projects that tie in location and date. How about doing the same thing with civil war letters? How about mining the lyrics from Broadway musical songs?

If this all sounds interesting, take a look at the video below and read more on the Hip-Hop Word Count Kickstarter home page. If half the people who follow my RSS feed pitch in $10, this project would be funded. Take a look and consider pitching in. If this project doesn’t speak to you – take a look around Kickstarter for something else you might want to support.

SEO Evaluation of an Archival Website: Looking at UMBC’s Digital Collections

Flickr Commons: Do-it-yourself-womanEach week brings announcements of archives launching new websites. Today both my email and Twitter told me about  University of Maryland, Baltimore County’s new Digital Collections site. Who can resist peeking at new materials available online?

I have spent much of the past year learning the details of Search Engine Optimization. Usually shortened to SEO, this simply refers to the use of techniques which improve the traffic sent to a website via organic search. Want your webpage to show up at the top of the list for a specific search in Google? You want to work on your SEO.

So when I look at new archives website, I can’t help but keep an eye open for how well the site is optimized for search engines.

I hope that UMBC will forgive me for nitpicking their new site. A lot of their choices are great for SEO,  but they also have room for improvement.

Things Done Well for SEO

  • Home Page Title & Description: The site’s home page has a good meta description. This is the text displayed below the link on a search results page – as shown below:UMBC Digital Collection Google Result
  • Unique Page Titles At Collection Level: Each photography collection homepage has a unique page title and a nice block of explanatory text. Google can only read words – so the more unique text on a page, the better the job Google can do in figuring out what your page is about. Example: Ardsley Park Album
  • Good anchor text: (also known as link text) The words used in anchor text tells search engines information about the destination page. For example, the blue text below is anchor text. UMBC Anchor Text Example

Areas for SEO Improvement

  • Unique Page Titles At Item Level: Individual images and documents all use a generic page title such as ‘UMBC | Digital Archive | Document Viewer’. Document Example: Accidental Death of an Anarchist Image Example: 10 year old Bootblack
  • H1 Tags: In the HTML of each page, the dominant heading of the page should use the <h1> tag. This helps Google know the phrase you are targeting with this page. It is your 2nd best place to emphasize your content after the page title. In the case of the item pages, there seems to often be a headline type title at the top of the page – but it currently is not an demarcated with an <h1> tag.
  • Think About Search Results and Indexing: Pages displaying results of internal searches on your site are not likely to be useful as indexed pages in Google. The thinking here is that they can dilute the focus on the item and collection level pages on your site if Google also has many search results pages in the index. If UMBC wanted their search pages to be indexed, then those pages’ URLs should be simplified and the search results pages need a page title that somehow includes the search criteria. There are two ways that I know of to disable this indexing – blocking via the site’s robots.txt file or via a robots meta tag in the header of the search results page. Both of these methods tell obliging search engines to not crawl certain parts of your site.

Final Thoughts

There are plenty of other things that UMBC could do to support this new website. They could create an XML sitemap of all their pages and submit it to Google (maybe they already have). They might re-title some of their pages based on using a tool like Google Insight to see what variations of a phrase is searched on most frequently. My goal here was to give you a taste of the sorts of things that catch my eye. Also, SEO is still more of an art than a science – so you will sometimes notice that what one SEO expert recommends is the opposite of what the next expert would tell you.

In many cases changes, such as the Unique Page Title at the Item Level mentioned above, may not even be possible due to software or programmer resource limitations. The trick is to take advantage of every option that is available. There are also trade-offs to be made. UMBC’s site provides some very slick interfaces for viewing the details of a group of documents, such as theater programs and other materials related to a theatrical production. The imlementation elegantly handles the situation of multiple scanned images which relate to a coherent set of documents. Sometimes you can’t have both your innovative UI and perfect SEO. Then it gets down to what your goals are for your website. Are you trying to make a specific community of existing users happy by providing them with tools they can use? Or does your mission focus more on reaching out to a broader audience?

There is no silver bullet to search engine optimization. It just takes knowledge of the available tools and techniques combined with a willingness to keep learning and experimenting. Like the ‘Do-It-Yourself-Woman‘ pictured above in the Nationaal Archief‘s photo I found out on the Flickr Commons, you too can learn the basics and do-it-yourself. A great starting point is Google’s free SEO Guide. Also, please remember that the best time to plan your SEO strategy is before you have built your site in the first place!

I would love to do research on how much progress archives websites can make in their organic search traffic after SEO improvements. My thinking is to take a snapshot of a month of analytics (the statistics that tell you how many people are visiting your website) and then apply some SEO inspired changes. After a suitable delay (it takes some time for SEO to do its job) we consider another month of analytics to determine any change in organic traffic.

Do you want me to do a quick review of your archives website to see if there is room for SEO improvement? Please contact me or add a comment to this post. I feel like there is a conference presentation in all this if we can find a good set of websites to optimize.

Finally, thank you to unsuspecting UMBC – your new website really is beautiful.

Image credit: Doe-het-zelf vrouw /Do-it-yourself-woman from Nationaal Archief on Flickr Commons.

Yahoo & Google’s Search for Reusable Images and the Flickr Commons

When I read about Yahoo Image Search’s recent addition of a filter to return only creative commons Flickr images, I got all excited about what this might mean for images in the Flickr Commons. So I raced off to the Yahoo Image Search page to see how it works. The short answer is that the new special rights setting of  no known copyright restrictions that they created for members of the Flickr Commons apparently doesn’t count.

For my test I searched for an exact match on “Ticket with portrait of George Washington”. This returns one result – the one image in Flickr with the same name, from The Field Museum in Flickr Commons. If you click on the ‘More Filters’ link, you will see other ways to filter your Creator permits reuse - Yahoo image searchresults – including the option to restrict your results to only include images whose creators permit reuse.

Next I clicked in the ‘Creator allows reuse’ and my one result disappeared! Quite disappointing in my book.

Google is also getting onto the ‘make it easy to search for reusable images’ bandwagon. Search Engine Land reported that Google Images Quietly Adds Creative Commons Filter. That post pointed me to Google  Operating System‘s search interface that lets you play with the options that Google has available. After a clicking through to some of the images returned by a Google Image Search for creative commons images of archives, the way the Google model appears to work is to look for creative commons badges or links on the page with the image. I even found Flickr creative commons images, but when I tried to find my Flickr Commons image of the ticket used above for my Yahoo image search experiment it wasn’t returned by Google either.

So if an archives (or museum or library) posts images on a page that indicates that the content is licensed under creative commons, it seems those images will then appear in Google’s image search as reusable. That is good news! Another way to get users to find your public domain images.

The question I am left is how to resolve the gap between Flickr Commons’ ‘no known copyright restrictions  rights statement and both Google and Yahoo’s definition of reusable content.

Sunshine Week 2009: Archives, Records and Other Online Government Information

Sunshine Week Sunshine Week 2009 is a national initiative spearheaded by journalists to “open a dialogue about the importance of open government and freedom of information”. The Electronic Frontier Foundation (EFF) chose to mark Sunshine Week this year by announcing the release their new tool for searching EFF’s FOIA documents. Learn more about EFF’s efforts to make open government a reality in this EFF call to action.

The Sunshine Week blog announced the release of a 2009 Survey Of State Government Information Online. The survey results explains:

Using a standardized worksheet surveyors rated each section on its usability, looking at factors such as whether the information was clearly linked, if full reports or only summaries were available, whether viewing and/or downloading was free, and whether the data were current. The categories for the survey were selected for generally serving the overall public good — the kind of information people need for their own health and well-being and that of the community.

See the worksheet for details on the categories selected for inclusion in the survey and the results for lots of interesting tidbits about exactly which states provide access (or not) to various public information online. A few very randomly selected highlights:

  • Maryland: Nursing home information, mhcc.maryland.gov/consumerinfo/nhguide, got high marks for facilitating online search and for allowing users to “compare data in a variety of ways.”
  • Iowa: The state auditor’s office reportedly offers online more than 5,000 full reports of all its audits dating back to 2001. The audits are easily accessible from tabs on the main Web page, www.auditor.iowa.gov.
  • Colorado: Bridge inspection reports in Colorado are considered public, but they are not published online. Anyone who wants to see the reports is advised to file an FOI request.

All of this made me recall my blog post about the parallel goals of journalists and archivists when considering digital public records and databases. I wanted to celebrate Sunshine Week by looking for other online sources of government information. My first stop was the website of the Council of State Archivists (CoSA). They had a couple of great resources including:

A bit further afield we find GovernmentDocs.org advertised as a “community government document reviewer system”. On their about page we read:

With the GovernmentDocs.org system, citizen reviewers can engage in the government accountability process like never before. Registered users can review and comment on documents, adding their insights and expertise to the work of the national nonprofit organizations which are partnering on this project. This new information then becomes instantly searchable. The text of each document is searchable, as well, thanks to a powerful Optical Character Recognition (OCR) functionality.

GovernmentDocs.org adds a powerful layer to government transparency and accountability by indexing documents in a user-friendly manner that is remarkably easy to share. Every page of every document has its own unique url, allowing you and other users to link to that page on blogs, send emails about the documents to friends, and expose the information to a wider audience.

Here is an example GovernmentDocs page taken from a request submitted by CREW (Citizens for Responsibility and Ethics in Washington) regarding the Endangered Species Act. Each GovernmentDocs page has a unique URL, full text transcription of the page and supports comments and reviews. The possibility of building up a community around these records is very real. I am curious to see how many citizen reviewers and comments are associated with these documents a year from now.

Please help celebrate Sunshine Week by exploring all these amazing resources!

Google Tackles Magazine Archives

Google Book Search: Popular Mechanics Jan 1905 Cover ImageAs has been reported around the web today, Google is now digitizing and adding magazines to Google Book Search. This follows on the tails of the recent Google Life Photo archive announcement.

I took a look around to see what I could see. I was intrigued by the fact that I couldn’t see a list of all the magazines in their collection. So I went after the information the hard way and kept reloading the Google Book Search home page until I didn’t see any new titles displayed in their highlighted magazine section. This is what I came up with, roughly grouped by general topic groupings.

Science and technology:

Lifestyle and city themed:

African American:

  • Ebony Jr!: May 1973 through October 1985
  • Jet: November 1961 through October 2008
  • Black Digest: Named ‘Negro Digest’ from November 1961 through April 1970, then Black Digest from May 1970 through April 1976.

Health, nutrition and organic:

  • Women’s Health and Men’s Health: January 2006 through present. I found it very amusing to be able to scan the covers of all the issues so easily – true for all of these magazines of course, but funny to see cover after cover of almost identically clad men and women exercising.
  • Prevention: January 2006 through the present
  • Better Nutrition: January 1999 through December 2004
  • Organic Gardening: November 2005 to the present
  • Vegetarian Times: March1981 through November 2004

Sports and the outdoors:

They of course promise more magazines on the way, so if you are reading this long after mid December 2008  I would assume there are more magazines and more issues available now. I hope that they make it easier to browse just magazines. Once they have a broader array of titles – how neat would it be to build a virtual news stand for a specific week in history? Shouldn’t be hard – they have all the metadata and cover images they need.

I love being able to read the magazine – advertising and all. They display the covers in batches by decade or 5 year period depending on the number of issues. I also like the Google map provided on each magazines ‘about’ page that shows ‘Places mentioned in this magazine’ and easily links you directly to the article that mentions the location marked on the map.

I think it is interesting that Google went with more of a PDF single scrolling model rather than an interface that mimics turning pages. In many issues (maybe all?) they have hot-linked the table of contents so that you can scroll down to that section instantly. You can also search within the magazine, though from my short experiments it seems that only the articles are text indexed and the advertisements are not.

Google’s current model for search is to return results for magazines mixed in with books in Google Book Search results – but they do let you limit your results to only magazines from their Advanced Search page within Google Book Search. See these results for a quick search on sunscreen in magazines.

Overall I mark this as a really nice step forward in access to old magazines. As with many visualizations, seeing the about page for any of these magazines made me ask myself new questions.  It will be interesting to see how many magazines sign on to be included and how the interface evolves.

To read more about Google’s foray into magazine digitization and search take a look at:

For a really nice analysis of the information that Google provides on the magazine pages see Search Engine Land’s Google Book Search Puts Magazines Online.