Runaway Project Overview

Ok, for newspapers that match an existing one in your system ill tag it with the correct ID, for new newspapers ill try and grab your requested fields from elsewhere in the book so you can assign a new ID.

From: Brandon T. Kowalski <kowalski@cornell.edu>
Sent: Thursday, April 23, 2020 9:41 AM
To: Eric Anderson, CFA <eanderson@skylarcap.com>
Cc: Edward Eugene Baptist <eeb36@cornell.edu>; Bill Perkins <BPerkins@skylarcap.com>
Subject: Re: [External] Runaway Project Overview

Hi Eric,

For all Advertisements we need information about the newspaper (3 fields: newspaper name, publication city, publication state).

You can safely record this once and reference it each time this newspaper is used.

In addition to the newspaper info we have a handful of fields we request on initial import. These can be provided in a spreadsheet.

Filename of Image (relative path please)

Publication Date(ISO8601 Please! YYYY-MM-DD)

Page Number

Full Text Transcription

Of these four fields, publication date is always required.

If the advertisement is coming from a book, like the PDFs Ed sent over, then a Filename is not expected but a Full Text Transcription is required.

If the advertisement has a scan then the FTT is optional.

Hopefully this break down makes sense. Please let me know if you need me to clarify anything.

~btk

On Apr 23, 2020, at 10:28 AM, Eric Anderson, CFA <eanderson> wrote:

Thank you for sending these, ill take a look at breaking them up into individual ads. Should I include any other data points besides the newspaper name, newspaper id and publication dates?

My priority right now is on cleaning the existing images, I am also working on some AI feature extraction and will share updates as I make progress. Also we are putting together a website that will aggregate all runaway project notes and correspondence for easy reference by your team, ill share those details once it’s up.

From: Edward Eugene Baptist <eeb36>
Sent: Wednesday, April 22, 2020 6:58 PM
To: Eric Anderson, CFA <eanderson>
Cc: Bill Perkins <BPerkins>; Brandon T. Kowalski <kowalski>
Subject: Re: [External] Runaway Project Overview

These are great! I’m sharing a link for the pdfs from the books that I mentioned to you earlier. Let me know how these look—they contain several thousand more ads.

https://drive.google.com/open?id=18IHljQo563f5NBdmmAgI1mTwA7QVC95l [drive.google.com]

Great talking to you today. I’m cc’ing Brandon Kowalski, who’s our lead developer.

Ed

Edward E. Baptist

Professor, Department of History

450 McGraw Hall

Cornell University

Ithaca, NY 14853 USA

Freedomonthemove.org

From: Eric Anderson, CFA
Sent: Wednesday, April 22, 2020 4:31 PM
To: Edward Eugene Baptist
Cc: Bill Perkins
Subject: RE: Runaway Project Overview

Here is a seasonality chart with monthly instead of weekly granularity, it does a better job showing the lag between Jailer and Enslaver ads:

<image001.png>

From: Eric Anderson, CFA
Sent: Wednesday, April 22, 2020 2:31 PM
To: ‘eeb36’ <eeb36>
Cc: Bill Perkins <BPerkins>
Subject: Runaway Project Overview

Hello Edward, as discussed here is an overview of our process along with the data we have cleaned so far.

<image002.png>

-Our scraper has pulled your ads, events, runaways, newspapers and other tables and descriptors from your website and stored them on our SQL server. I will add to this an additional process which I will develop to separate from PDFs the new ads you are going to send us.

-Ads are first sent through the Google Vision API (https://cloud.google.com/vision[cloud.google.com]) to get a first draft transcript.

-The first draft transcript is sent to Amazon Mechanical Turk (https://www.mturk.com/ [mturk.com]) for revisions into two distinct final draft candidates.

-The Best Transcription Selector is a program which analyses each final draft candidate and picks the best one to be chosen as the final draft.

-Once an ad_id has a final draft attached it is run through the Event Generator which uses cosine similarity to match the ad to an existing group of ads or create a new event if no match is found.

Ideally I would like to use an AI for feature extraction from the ads but for any feature or ad where the AI fails to deliver what we want we will use MTurk. I’ve attached what ad cleanup we have to far. Ad_ids are the same as yours so you can join with your database if you want to pull further details, for this query I kept the columns simple. There are 27,420 ads, where your transcription is available it is populated in the ‘fotm_transcript’. For the remaining ~15k ads we will be running all of these through the process I described above. ‘transcript_vision’ contains the results of the google vision api first draft, some are pretty good, others need more work. ‘lang’ describes the language detected by google, we are focusing on the English ads for now and will visit the French ads next. ‘transcript_turk’ contains the final draft after going through MTurk, there are about 1k of these, another 1k are done but their batches on MTurk have not completed yet so as soon as those finish ill download and send an update. About 1k ads will need new images due to inadequate size you can identify these as the ads with missing data in the fotm_transcript, transcript_vision and transcript_turk fields, or by multiplying img_px_w and h and filtering by < 20,000. Event_id is our unique identifier generated based on ad similarity.

Here are some of the charts discussed in the call:

Anomalous number of events in 1828 and 1854

<image003.png>

Runaway event seasonality by min publish date.

<image004.png>

Obviously a larger sample might change some of these observations.

I’m looking forward to continuing work on this project. I’ll keep you in the loop as I make progress, please reach out at any time if you have questions and don’t hesitate to give me your candid feedback.

Thanks,

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell

Runaway Project Start

I was asked to work on an AI that will pull data out of these ads and present some data and analysis?

Questions to be answered?

How many?

By year

By state

What time of year

Male /female

How many mulatto/mixed (insight into rape)

Any other signal you may be able to get.

I would be honored to work on this project.

Here is how I see the project working:

It looks like some of the articles are already transcribed and some are not. Assuming there is a sufficient sample of transcribed articles I will only need to scrape and database and process into a model, otherwise I will need to add a transcribing project to the below. There are some transcribing services through google that might work depending on the quality of the image.

  1. Data collection
    1. Scrape data and relevant characteristics and load into a database
  2. Create Training Dataset
    1. Select a sample of articles, add data labels (Amazon Turk?)
  3. Build Proof of concept model
    1. Share results and get feedback
  4. Set final model
  5. Refine and expand model to reveal more characteristics & improve accuracy.

Sector based declines

I’ve reviewed the data and started building the scraper that will database it all. It looks like there are a good number of ads that are already transcribed and all those that are transcribed also have summary statistics associated with them that satisfy the features you outlined (How many, year, state, time of year, ect…) So my next step after databasing is going to be generating analysis and views of the existing transcribed data to get a better understanding of it and decide if there are any more features needed before developing the AI to transcribe and extract features on the remaining ads. Ill send that analysis to you for review to make sure im on the right track before getting too far into the model building.

Checking in on slave project

I have all the advertisements databased and a scraper in place to gather new data form the site as it is posted. I have also begun exploring the data and tagging relevant features that I will want the AI to extract from the transcripts. A couple of interesting things I’ve found in the data so far:

There is a strong seasonality to runaway events:

There are a couple of exceptional years where runaway events occurred at a high rate, 1828 and 1854. 1854 could be an outlier year because of the tense political climate which began then, it was the start of bloody Kansas, I’m not certain why 1828 would stand out but it could be from the Tariff of 1828 which hurt the southern economy creating more opportunities for escape attempts.

I expect to have more summary analysis for you this week which will be based on the portion of the advertisements from the site that have complete features listed. I will also start building some proof of concept models and begin testing the accuracy of AI at extracting certain features.

Yes, the ads are unique, however there are two kinds of ads:

  1. Posted by enslaver, searching for a runaway
  2. Posted by a jailer after a runaway was captured

Ideally we would like to try ad pair the two to eliminate the possibility of two ads appearing for a single runaway event but the ads themselves are scant on details so trying to match them up is really just guessing. I think the best way to proceed is to treat both groups as separate samples of the total ground truth dataset with the understanding that each group may contain overlap.

Something else I was thinking about is that a single ad poster may run their ad in multiple newspapers, I will check for that occurrence based on transcript similarity and try to eliminate any duplicates.

Also our data could be biased by the survivorship of newspaper archives. What amount of newspapers survived from the 19th century? And for what regions? Answering these questions will require some kind of expertise in the availability of historical documentation from that period. Without this knowledge we may mistake survivorship biases of newspapers for trends in ads.

We can at least color code for jailer or owner with ads in charts.

We can create or known the total number of newspapers avail by year versus the sample we have. Then use that to arrive at an aggregated estimate of advertised runaways.

Looking to infer the prevalence of rape by the keyword mulatto, light skin etc in ads, I could do a text search but those tend to be inaccurate (‘it was light outside when the runaway occurred’ vs ‘the runaway had light skin’) I will probably want to use an AI to find these.

Type of ad is something the AI will need to determine before we can proceed with high level analysis because right now >80% of the ads have not been categorized:

I will make ‘advertisment.type’ my first proof of concept AI model that way we can start high level analysis while I continue work on the other features.

I added a new axis of distinct number of newsapers/year to hopefully explain some outlier ad years based on newspaper availability. Of course this is all coming from the same dataset so it is possible a newspaper existed in a year but did not publish any runaway ads in which case it would not show up in our count even if there is a historical record.

About 20% of the ads are completed which basically means a lot of the features that we want have been extracted by hand, this is good because I will use this for training data in the AI model. There is a field called ‘racial_description’ and here are the top 15 responses for that field. I will need to simplify these categories for our model prior to training (probably into three categories ‘lighter’, ‘darker’, ‘average’).

Also I thought it was interesting that (for the completed ad sample) the average runaway is young (25.72 years old) and male (75%)

I’ve completed a proof of concept version of the ad type classifier AI model. I say ‘proof of concept’ because a lot of the development time for these models goes into hyper-parameterizing basically fine tuning that creates additional accuracy, this can be very time consuming so POC models are a good place to start establishing a method before making the time investment.

This POC model produced an accuracy of ~96% and classified the 27k ads as 93% enslaver and 7% jailer.

I’m going to start working on demographic POC models now starting with multiple v single runaway classification and then going on to gender, racial_description and age.

Here are some previous charts I sent you but with the new ad classification:

I finished the construction of multiple v single runaway classification AI, the accuracy for the POC model is 94%.

The single runaway ads will be pushed directly into an AI that will return our desired demographic data. The multiple runaway ads will need to be split into relevant bodies of text that contain the descriptors for each runaway before being pushed into the same AI. We can then aggregate and study the results.

As I have started constructing this system I am coming across some data quality issues that need attention.

-First of all many of the transcriptions are missing, these ads slipped past my original criteria because their transcription fields are populated but they contain things like “[illegible]” or “NAN”

-Some of the ads have been entered more than once despite the fact that they are from the exact same image, interestingly their transcriptions contain differences that avoid simple duplicate text detection.

-Some ads are published in multiple newspapers (as we discussed) but their transcriptions are different as some sentences have been slightly reworded perhaps by the editor or printer.

Here is how I plan to correct these data quality issues:

1. Begin work on a transcription program to process the ads with missing/incomplete transcriptions.

2. Create an algorithm that compares original transcription images and tags duplicates.

3. Create an algorithm that uses Cosine Similarity [en.wikipedia.org] to detect rephrased duplicate ads.

The results from this cleaned data will be databased and then used in all training and analysis. Unfortunately, this data cleaning will need to be done prior to any more analysis progress. I will prioritize items 2 and 3 so that we can at least perform analysis on most of the ads before finishing item 1.

There are 27k ads, right now there are about 3k that I have been using for training. I would estimate that about 50% of the 27k ads will need to be re-transcribed. I’m still working on the duplicate detection algorithms so ill get back you with those figures.

Yes, ill create a log of errors so we can share it with the Freedom on the Move people, do you have any contact with them? If so I would love to have a call with them and just get a little more background on where the data comes from and how they are generating their transcriptions (program? Crowdsourced?). Maybe they would be willing to share some resources with me?

Mechanical Turk Prepaid HITs

The 1,000 test batch on Turk is now live, I’ll be monitoring the results as they come in but if you want to monitor them yourself too let me know and ill send you the account credentials via signal.

This is what a sample task looks like:

Review of Amazon Turk Results

Of the 3,000 jobs (1,000 ads each done 3 times) no changes whatsoever where made to 2,068 of them. This could mean that the Google transcription was just that good however of the ads where no changes were made most Turks clicked submit within 10 seconds or less which is not enough time to even review the image and transcription. To me this says that the Turks here were just hoping to slip through any quality control and get paid for doing nothing. Some spot checks of these ads confirms this. Below is a histogram of the time taken on each job when no changes were made.

Of ads where changes were made the Turks spent much more time (an average of over 2 minutes). I will manually review all ads where less than 30 seconds was spent on the task. I will spot check a collection of all ads where changes were made. Below is a histogram of seconds spent where changes were made to the transcripts.

We will be refunded for all the unacceptable results.

I’m going to make a few changes and would like to resubmit the batch for another test.

  1. Limit the jobs to only highly rated Turks.
  2. Add a bold warning near the submit button that all transcription errors must be fixed or the task will be rejected.
  3. Increase the allotted time to 20min per job.

The batch is ~23% complete with an estimated completion time of tomorrow at midnight.

My spot checks on this batch show a much higher level of quality, very acceptable work. The average Turk is now spending 6 minutes reviewing and correcting each ad:

Here is one example of the quality of work.

Google API Transcript:

“ROKE JAIL-On the night of the 3d inst. a Negro Fellow by the name of TOM, about 6 feet high, and about 2 yearn of dge, very slout made, complexion not very blnck, Aud a very likely Negro; recently from Augunta, Geo.. Ala, a mulatio Fellow by the name of JESSE, about 5 feet 10 inches high, and about $3 yearn of age, a barber-hy trndr, and a stout well builu fetlow, large whiskers, and is vers intelligent; he isalao from Auguste, Geo. A liberal reward will be paid for both, or either of thém, by lodging tliem in the Charlenton Jail. B" J. TOBIAS, Jailur, C. D. Jan 5”

Turk Transcript Revision:

“BROKE JAIL-On the night of the 3d inst. a Negro Fellow by the name of TOM, about 6 feet high, and about 32 years of age, very stout made, complexion not very black, And a very likely Negro; recently from Augusta, Geo. [illegible], a mulatto Fellow by the name of JESSE, about 5 feet 10 inches high, and about 33 years of age, a barber by trade, and a stout well built fellow, large whiskers, and is very intelligent; he is also from Augusta, Geo. A liberal reward will be paid for both, or either of them, by lodging them in the Charleston jail. J. TOBIAS, Jailor, C.D. Jan 5 3”

Runaway Update

I’ve finished reviewing the completed results of our latest MTurk batch. I’m happy with the quality but before we run another batch I would like to run what we have all the way through our data process, that is take these transcriptions all the way through to completed AI model outputs of summary data about them (prelim AI models). This is a test of my proposed process, if I find that something is not what I expect that might change how I want the Turks to do future transcriptions. Once the process is dialed-in brining in additional ads will be straightforward and fast.

Ovals represent processes, squares SQL tables.

-Scraper pulls ads from the FOTM website and stores them in the run_ad table. (100% done)

-Ads with no transcription are run through the Google Vision API to generate a draft transcript. (100% done)

-The draft transcript and image are run through MTurk for transcription edits which are stored in run_ad_processing. (process is 100% done, only ~10% of ads have been run through MTurk)

-There are multiple transcriptions per ad so the Best Transcription Selector uses AI to pick the best one and stores which is the best in run_assignment. (20% done with development)

-The best transcription per ad is processed through event generation which separates ads into events. For example one ad is run for multiple days and even sometimes in multiple newspapers, all these ads must be grouped together into one single event to avoid over-counting. (20% done with development)

I can send you updates as I develop and refine the process, once it’s all done we have 10,768 additional ads we can run though.

Let me know your thoughts.

Runaway Project Update

As mentioned I have completed development of the transcript selector as well as the event generator. Event generation is particularly important because ads are run in a series and often times in multiple newspapers spanning up to nine months for some of the longer running ads. Without accurate event generation you will overrepresent the same runaway event multiple times creating unusable results. Originally I relied on the FotM event ids but have since discovered that these are terrible, our event generation measures the cosine similarity between the text of every ad to find pairs, in my checks I have found it to be extremely reliable and with that framework now in pace and about 13k reliable transcripts we have mapped out ~8k distinct events. From these events I have generated the following updated charts:

As you can see 1828 and 1854 still stand out as outliers but not to the extent that they did with the FotM event IDs. Seasonality remains a factor as before but with a better jailer sample we can see jailer seasonality lags enslaver as you would expect from the lag between runaway events and capture events.

I’m continuing to develop new AI POC models on this data with mixed results. The techniques im using to design the models come from other text based projects ive worked on in the past but none of those have text inputs this long (the last text AI I built was for LNG ship stated destinations those were only 90 characters). My POC models for things like Jailer v Enslaver ad types and single v multiple runaways work pretty well using these techniques but other models I’ve built which try to extract specific information from the ad like the amount of the reward offered don’t seem to work as well as I expect they could.

We could always just brute-force the problem with MTurk but before we resort to that I would like to give some more advanced AI techniques a try. I would like to invest some time in learning some of the proven techniques related to AI unstructured text processing. This time investment (probably a couple of weeks) will unfortunately slow the project down but I think in the long run it gives me some additional AI expertise for my toolbox and in the near term it makes this project cheaper, more flexible (we can easily tweak the AI to give us new features but doing the same from MTurk would require an entirely new batch to be run on the entire dataset) and easier to scale. In my opinion this is the biggest benefit of doing things in house as opposed to outsourcing, we develop new skills and talents that we can apply elsewhere, in this case I could see this new knowledge being applied in the future to something like a pipeline notice project.

While I’m developing these more advanced models we can send the remaining ~14k ads to MTurk for transcription. I don’t think 3 jobs per ad will be necessary, 2 is enough. This will cost about $14k to run.

Let me know your thoughts.

Runaway Project Overview

Hello Edward, as discussed here is an overview of our process along with the data we have cleaned so far.

Raw Data Download 18MB- Data for FotM

-Our scraper has pulled your ads, events, runaways, newspapers and other tables and descriptors from your website and stored them on our SQL server. I will add to this an additional process which I will develop to separate from PDFs the new ads you are going to send us.

-Ads are first sent through the Google Vision API (https://cloud.google.com/vision) to get a first draft transcript.

-The first draft transcript is sent to Amazon Mechanical Turk (https://www.mturk.com/) for revisions into two distinct final draft candidates.

-The Best Transcription Selector is a program which analyses each final draft candidate and picks the best one to be chosen as the final draft.

-Once an ad_id has a final draft attached it is run through the Event Generator which uses cosine similarity to match the ad to an existing group of ads or create a new event if no match is found.

Ideally I would like to use an AI for feature extraction from the ads but for any feature or ad where the AI fails to deliver what we want we will use MTurk. I’ve attached what ad cleanup we have to far. Ad_ids are the same as yours so you can join with your database if you want to pull further details, for this query I kept the columns simple. There are 27,420 ads, where your transcription is available it is populated in the ‘fotm_transcript’. For the remaining ~15k ads we will be running all of these through the process I described above. ‘transcript_vision’ contains the results of the google vision api first draft, some are pretty good, others need more work. ‘lang’ describes the language detected by google, we are focusing on the English ads for now and will visit the French ads next. ‘transcript_turk’ contains the final draft after going through MTurk, there are about 1k of these, another 1k are done but their batches on MTurk have not completed yet so as soon as those finish ill download and send an update. About 1k ads will need new images due to inadequate size you can identify these as the ads with missing data in the fotm_transcript, transcript_vision and transcript_turk fields, or by multiplying img_px_w and h and filtering by < 20,000. Event_id is our unique identifier generated based on ad similarity.

Here are some of the charts discussed in the call:

Anomalous number of events in 1828 and 1854

Runaway event seasonality by min publish date.

Obviously a larger sample might change some of these observations.

I’m looking forward to continuing work on this project. I’ll keep you in the loop as I make progress, please reach out at any time if you have questions and don’t hesitate to give me your candid feedback.

Here is a seasonality chart with monthly instead of weekly granularity, it does a better job showing the lag between Jailer and Enslaver ads: