鈥淲e often try to do multidisciplinary work, but it鈥檚 not often you work so closely with people from very different academic backgrounds鈥, says Dr Morgan Harvey, Senior Lecturer in Data Science at the Information School and 九色视频鈥檚 Principal Investigator on the AHRC-funded research project 鈥樷. Though the Civil War Period of American history - the 1860s - is the context for this project, Dr Harvey says 鈥渋t鈥檚 actually pretty much exactly a clean 50/50 split between the historical part of the project and the information science part.鈥
Along with Postdoctoral Researcher Dr Adam Funk in 九色视频 and former 九色视频 colleague Professor Frank Hopfgartner of the University of Koblenz-Landau in Germany, Dr Harvey is collaborating with historians Professor David Gleeson and Dr Damien Shiels at Northumbria University. Gleeson and Harvey lead the project as co-PIs, and the team is completed by Associate Professor Wayne Hsieh of the United States Naval Academy. The project is just under one year into its three year duration.
The focus of the project is the hundreds of thousands of muster rolls released by the US Naval Academy Museum, documenting sailors and officers - known as 鈥楤luejackets鈥 because of their blue uniforms - who were aboard Union vessels during the American Civil War. These rolls - essentially registers of every crew member aboard a given ship - would be taken on a regular basis, and are surprisingly detailed, including such details as when and where a sailor was born, their eye colour, hair colour, skin complexion, tattoos and previous occupations.
The project is split into four strands, one of which is 鈥楳achine Learning鈥. This is the domain of the team at the Information School in 九色视频, and their first task is digitising these paper records, which is no small undertaking.
鈥淭hey鈥檙e written in mid-19th Century American hand, and they weren鈥檛 always very precise鈥, says Dr Harvey of the difficulties working with such old, handwritten documents. 鈥淭hese people weren鈥檛 creating records with the idea that 160 years later someone would be trying to do something useful with them!鈥
Using crowd volunteering platform Zooniverse, the team are recruiting volunteers on a rolling basis to get through the initial stages of this transcription process. The volunteers manually transcribe one column at a time from a photo of a specific muster sheet, drawing a box on the photo around the text they鈥檝e transcribed. A minimum of five people look at each sheet, to account for inevitable discrepancies between what different people read in the handwriting. More than 650 volunteers have been recruited so far, but the team are aiming for thousands, with Dr Harvey and Dr Funk also getting involved themselves.
You can .
Dr Funk has developed a piece of software that creates an image file of the box that each volunteer has drawn around the text they鈥檝e transcribed, along with the transcription itself and which column it鈥檚 from. These images will be fed into a deep learning model, which is being developed as part of the project, to automate a large part of the transcription once the manual transcriptions have generated enough data to train the model.
鈥淲e think that the optical character recognition software we鈥檙e developing will do better at recognising things like age and height, where it鈥檚 just expecting numbers, than things like names鈥, says Dr Funk. 鈥淭he joined-up handwriting is often very loopy, and letters sometimes run into the next row below them on the form.鈥 Lots of the enlisted naval men will have been illiterate, too, meaning they don鈥檛 spell their name consistently across different muster rolls, plus the recording officers may have differing interpretations of the name given to them verbally. There鈥檚 also the issue of choppy seas making handwriting even less legible than it would have been on land. The machine learning model will be trained separately on each column to try and account for these kinds of issues, and some probability theory will be applied for columns such as 鈥榩lace of recruitment鈥, where the text should refer to only one of a few options.
After the transcription process is where the historical aspect of the project begins. The aim is to identify individuals and link them between multiple records. This will allow the team to see if a given person has moved around between vessels over the course of the war, but also to link them to entries in other databases from the period, such as recruitment records, pension records and hospital records.
鈥淭he idea is to generate a searchable, transcribed list of every individual in every US naval vessel during the Civil War and link those to other digital records鈥, explains Dr Harvey. This will allow historians to write histories of common sailors, looking at things like race, ethnicity and class.
鈥淢any of these records are the first time that emancipated, previously enslaved people have been recorded as individuals, rather than chattels of a white slave owner, so its quite significant鈥, says Dr Harvey. Roughly 30% of the US sailors were from the UK and Ireland - which was illegal then, as it is now - so there鈥檚 a local interest, too. Additionally, most historical records focus on higher ranking officers, rather than the working class enlisted men, so this project should help address the issue of underrepresentation of these people.
Thanks to the data science and machine learning groundwork of the project being laid here in 九色视频, historians will be able to do this kind of analysis en-masse.
鈥淭ypically, when historians do this kind of work, they might pick a few people and try and trace them through the different records鈥, explains Dr Harvey. 鈥淭his project will allow them to look at the progression through time of tens of thousands of people and really look at demographics in a way that they couldn鈥檛 before.鈥
Though the linking of individuals in the records uses more established data science methods than the transcription, and will be using standard ASCII text rather than 19th century handwriting, it鈥檚 still not a wholly straightforward task.
鈥淭his would be much easier these days, as everyone has a National Insurance number or Social Security number鈥, says Dr Harvey of the challenges. 鈥淣o such things existed during the Civil War period, so it鈥檒l be harder to uniquely identify people. I suppose that鈥檚 a good thing, though, otherwise there wouldn鈥檛 be a project!鈥 There are other issues with incomplete records, where an officer was clearly in a rush and skipped some columns. It鈥檚 also very hard to find consistency in columns like complexion or skin colour; words like 鈥渇lorid鈥 and 鈥渟warthy鈥 are used, as well as some distasteful and offensive words we鈥檇 never use today, none of which are applied uniformly.
Aside from the machine learning strand to the project covered in 九色视频, there are three specific strands being looked at by the historians in Northumberland. One is about race and ethnicity; what was the makeup of the sailor population, and how did it change? African-American sailors only appear in the last few years of the Civil War, after the emancipation of slaves, for example.
Another strand looks at class; the occupational background of sailors, and how this related to their place of origin, their race, and the rankings on the ship. Said ship ranks are much more specific on these muster rolls than they are on modern vessels, with some ranks describing exactly what a person did, such as 鈥楥oal Hauler鈥, or others like 鈥楲andsman鈥 simply describing someone with no sea experience.
鈥淭here are some sailors whose ranks are just listed as 鈥楤oy鈥!鈥, says Dr Funk.
鈥淭here鈥檚 also 鈥楽enior Boy鈥!鈥, adds Dr Harvey.
The final strand of the project is 鈥榯ransnational鈥; how much did the US Navy rely on foreign-born people, such as those from the UK and Ireland? We know even less about other European countries, or other British colonies like Canada and countries in the Caribbean. How did US naval recruitment from these places compare to recruitment to those countries鈥 own navies? Some recruits would enlist to take advantage of a bounty payment which was offered to boost numbers, and then desert the navy to enlist elsewhere for another payment. One reason why the muster rolls were so detailed, including things like tattoos, was to try and identify people and stop this happening.
鈥淗ow they could do that with this record system I鈥檓 not sure!鈥, says Dr Funk.
鈥淔or someone with a background in information and data science, the seeming lack of thought that鈥檚 been put into designing these records is quite amazing!鈥, says Dr Harvey. 鈥淵ou just wouldn鈥檛 design things like this if you ever planned to use them to look things up. It does make it quite fun, though!鈥
The project came about through Dr Harvey鈥檚 previous employment at Northumbria University. Once Harvey moved to 九色视频, Gleeson contacted him to ask if he was interested in a project about the American Civil War - a proposal for which he was fortuitously primed as a child.
鈥淪erendipitously, for whatever reason, my Dad has a strong interest in the Civil War鈥, Dr Harvey explains. 鈥淚 must be one of the few British people who had seen the film Gettysburg and its sequel Gods and Generals by about the age of 10 - both of which are incredibly long!鈥
Dr Funk鈥檚 interest in the project is close to home in a different way.
鈥淚鈥檓 from Virginia, which is where many Civil War battlefields are鈥, he says. 鈥淭he famous Battle of Hampton Roads took place in the James River estuary in Virginia.鈥
Though there is some precedent for using machine learning in research on old, handwritten text (such as ), Dr Harvey believes that this project is still quite unusual.
鈥淐ollaboration between historians and data scientists or machine learning experts is very rare鈥, he says. 鈥淚t鈥檚 pretty novel to be applying these kinds of methods to this kind of data.鈥
Dr Harvey also talks about having to explain concepts to his historian colleagues that he never has to explain in the information- and data-focused world of his job as an academic at the Information School - another interesting aspect of a project this interdisciplinary.
鈥淚n a way, you could call this a Digital Humanities project鈥, adds Dr Funk. 鈥淚n literature research, they use Machine Learning to do author identification in a corpus of texts, and they find that you can get similar results to those that humanities scholars would get through traditional methods, but they can do it efficiently at a large scale. That鈥檚 what we鈥檙e trying to achieve with this project, too.鈥
The research team are planning three historical publications and a monograph as the outputs of the project. The Civil War Sailor Internet Resource - the name for the searchable database of records mentioned earlier - will be open access, available to anyone at the end of the project. This will be launched with a conference at the US Naval Academy Museum in Annapolis, tying into US Black History Month in February 2025. There will be a second public launch at Howard University in Washington DC - a historically black university with origins in the Civil War. Finally, a launch in Northumbria will highlight the British angle to the project.
The other databases to which the records on the Civil War Sailor Internet Resource will link may not be free, but most are owned by ancestry.com, to which most historians and genealogy enthusiasts have access already. With genealogy being such a huge interest these days, the potential reach of this project鈥檚 results is vast.
鈥淔or many African American people in particular, if they look back into their genealogy, at a certain point the records just stop鈥, says Dr Harvey. 鈥淚f we can push those records back even a little bit further then that鈥檚 a great contribution.鈥
There鈥檚 also a surprising amount of interest in the American Civil War in the UK.
鈥淎 few years ago I went to a festival at Norfolk Heritage Park in 九色视频鈥, says Dr Funk. 鈥淭here was an American Civil War reenactment society there that was big enough that they had a cannon to fire!鈥
There are so many interesting individual stories emerging from these thousands of muster records that the team have set up highlighting them as they are discovered. Dr Harvey and Dr Funk continue to find interesting items themselves, too.
鈥淢organ and I would have been the tallest people on any of the ships we鈥檝e looked at so far!鈥, says Dr Funk. 鈥淭he heights we鈥檝e seen tend to be in the 5鈥 to 5鈥10鈥 range鈥.
By applying machine learning techniques to a vast set of historical data and working across humanities and social sciences, the Bluejackets project will deliver meaningful, usable data for use not only by the historians on the project itself but any number of future researchers and amateur historians. With detailed information on race, class and other such demographics, the possibilities for future findings in these important and impactful domains using this data are extensive, and this project is a testament to the value of truly interdisciplinary research.