Defensible standards-based web and social media archiving for compliance, eDiscovery and corporate heritage; SaaS and on-premise solutions Hanzo specialize in building cutting edge web archiving technology, since mid-2014 I've worked with them to help build a variety of products and tools to improve their ability to crawl and playback archives as well as giving their customers better insight into what has been captured.
A key market for Hanzo is litigation, where law firms need to able to prove what was or wasn't said by a company or individual at any point in time. A daily capture is a good way to capture this data, but identifying changes would be a long and manual process.
The key takeaways from the exercise were that we could get quite a long way with word counts, hashing and page counts; while there'd still be a lot of noise, which tends to be the norm in web capture, we could demonstrably detect changes between captures, enough so that it'd be easier for the end-user to find what they were looking for.
After the change detection project we were able to identify changes between captures, but we needed to be able to display that to the end user. The initial brief was to perform image differentials, which I did try, but had pretty low quality results.
Hanzo populate a Solr index as part of their crawl strategy, we integrated this in a way that would allow users to search across time with pages grouped by URL and ungrouped. Text extraction and tagging meant that finding the captured content you're looking for worked fairly well.
Request a capture
Web archiving is complicated to say the least; Hanzo have a whole team of crawl engineers who specialise in the art of performing a successful web capture, there's lots of knobs and levers which for the most part lived in command line arguments until the web application was born.
Request a capture is a project Hanzo drafted to put these powers in the hands of their end users. Working closely with their CTO and Development Manager I helped them crystallise the options that were available and how we'd present them to end-users in a way that they'd understand.
The result was a seemingly simple form with a lot of power. Users specify the capture type, be that a standard website, or a Facebook or Twitter page, and that would govern which crawl driver is used, and also be able to supply settings for how many hops to go off-domain, or whether to include or exclude certain URLs.
Steve has an innate technical ability which is given further strength by his level of experience and general desire to do a good job. I have enjoyed working with Steve over and above many others because of his ability to gauge when a problem needs to be solved right and when a problem needs to be solved quickly. He always strives to do an excellent job regardless and makes the right engineering decisions. He learns new technology at an exceptional rate and is often the go to person to solve a technical challenge. As a team player, Steve appears equally happy to integrate with an existing team and role his sleeves up directly working on a project, help mentor less experienced developers, or lead a project as a technical oversight. I have always been happy taking Steve in to customer meetings safe in the knowledge that he will answer questions in a composed and accurate manner whilst not being afraid to give his opinion on technical issues. Steve is well liked and highly respected by his colleagues from all areas of the business. I will miss working with Steve and am confident in recommending him for any technical position.
Steve's been a strong contributor to our product software giving excellent advice and guiding our design to ensure a that we give our customers a good user experience. His coding has been top notch and he’s helped manage other team members to ensure that all aspects of the delivery have been what we need.