Runaway Annotation Batches

86/100 text annotation batches were of acceptable quality, the other 14 were not. It looked to me like the 14 had pretty basic mistakes and I suspect that those Turks did not bother watching my instruction video so I changed the Turk survey to say that watching the video is required prior to attempting and included a list of common mistakes that will result in rejection. I kicked off the 1,000 ad batch last night which is 93% complete right now. I’ve made the compensation $0.30/job because I figured this annotation is easier than transcribing for which we paid $0.50, I did get one complaint email from a Turk who did a lot of good transcription work for us before and I can see that the average job is taking over 5 min so I might raise the compensation in future batches.

Since the app is connected directly to my SQL server I can query the results in real time as they come in as opposed to the transcription jobs where I had to download them so im going to start building the AI model and just connect its training input directly to the Turk results, that way if I want more training data I just kick off another batch and the data automatically flows into the AI model as the Turks work.

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell

Runaway AI Construction

Hey Bill, here is some detail about how I am developing the runaway AI.

The AI must accomplish four tasks:

1. Annotate the individual words of the ads.

2. Group the annotations into phrases.

3. Determine number of runaways and assign phrases by runaway.

4. Clean and format phrases.

1. Annotation:

The AI iterates over every word in the ad and categorizes it into one of our feature groups or a null category. The method I chose for doing this is a non-sequential model which takes three inputs. Two of the inputs are at the word level and one is at the character level so that the model can process names and misspellings as well as clues from commas and periods. The two word level inputs are at different scales one I call the phrase scale which looks at the word in question as well as the one before and after it and the context level which looks at five words before and after the one in question. The character input is at the phrase scale.

2. Grouping:

The grouping output will be a 0 or 1 secondary output from the annotation model which will designate if the word after the one being examined should belong to the same group.

3. Assignment:

The number of runaways will be determined logically based on the number of unique runaway names and features mentioned in the ad. Each unassigned feature will be run through an AI model which will take as an input the text of the ad, placement of the feature in question and its position relative to the runaway names. The scores will be compared and ultimate assignments will be made logically (for example one runaway cannot have two different ages, so if there are two runaways use the one that the AI determines more likely and assign the other to the other runaway.)

4. Cleaning and Formatting:

All features are extractions of the text and so different formats are used from one ad to the next (for example a reward could be ‘$10’ or ‘Ten Dollars’.) I will test out various AI and logical methods to clean up the features but MTurk might be a good option here.

AI development is a mix of art and science and the process is subject to change based on experimentation but this is my development plan as of right now. I’m currently working on items one and two and I’ll send you updates as I make further progress.

Let me know if you have any questions.

Thanks,

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell

Annotation and Grouping AI

Done with the annotation and Grouping AI, it’s 93% accurate and my goal is >= 95%. Im saying ‘done’ because I think the only thing holding the accuracy back is training data, I’ve kicked off another MTurk batch to give us a little more data that should put us at my desired accuracy level.

I’m starting work now on the assignment logic.

Assignment Logic

I’ve completed version 1 of an assignment logic module that seems to work well. I’m starting work now on the cleaning and formatting module. I’m also continuing to clean the training data and refine the annotation and grouping. The results are coming in pretty good and the model is currently 94.1% accurate, I still believe 95% will be reached with a bit more data cleaning. I hope to have some results to share once data cleaning and formatting are complete.

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell

Some Runaway Results

Cleaning and formatting is done for gender, name and age with this complete we can start looking at some results.

  • The AI processed 23,972 English ads which grouped into 8,578 different events. From these events the AI detected 9,254 unique runaways.
  • As we saw in the sample data the runaways were mostly male (83.9%).
  • Average age is 26.3. Average Male is 26.6, Female 25.2.
  • Here is an updated runaway chart by minimum event publication date but now it accounts for the number of unique runaways per year not just unique events or ads as we looked at before.

  • Top ten most popular female names.

  • Top ten most popular male names.

We now have a complete end to end process for the above features on runaway ads with more features coming soon! I’ll continue building the cleaning and formatting functions for all of our features, I am now working on skin tone and height. I am also optimizing the model’s parameters and training data to further improve accuracy.

Let me know if you have any thoughts or questions.

Thanks,

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell

Location Data

As discussed I joined the newspaper table which contains city/state columns onto our cleaned runaway table to show a map of approximate runaway locations. The below map shows our entire dataset spanning ~150 years. As you can see our sample is highly centered in North Carolina with only one city each in South Carolina, Georgia, Tennessee and Louisiana and several important states having no data points at all. I think this demonstrates our need to continue getting more ads to help make the total dataset more useful.

On that topic I have resumed work on ad extraction from the books and I’m putting together an MTurk batch to speed up the French ad translation.

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell

New Ads

I have completed scraping the following sites:

Legacy of Slavery in Maryland ~12k ads

East Texas Digital Archives ~3k ads

Virginia Geography of Slavery ~4.5k ads

The Texas and Virginia ads contain images and transcripts for all ads while the Maryland site has images and meta-data but no transcript. I will begin the process of having those ads transcribed. My plan is to insert the ads into my server and assign them my own unique ID, then I will send you the ads with my IDs attached, you can reply back to me with a mapping of my IDs to yours.

I have also finished Brown.pdf and will follow the same procedure of assigning an ID and asking for a return ID. I hope to have all the PDFs separated into ads within the next two weeks, hopefully sooner.

French ad translations will hopefully be finished this week.

I’ve reached out to ReadEx for access to their newspaper database, I’ll keep bugging them.

Thanks,

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell

Cleaned East Texas Ads

I’ve attached a csv of the cleaned ads from the East Texas Archive website. I’m going to create a google drive to store the images on later and you will be able to find them with the ‘image_path’ field. I had to remove a lot of the entries from the East Texas scraped data. About half of everything on the site is not a runway or jailer ad, there are also a lot of entries from stories written about encounters with salves in Texas. While these are interesting to read they do not give any information about runaways so I removed them from this dataset.

Ad_id is my unique ad id, I would like to map this to whatever [advertisement.id] you end up generating for the ads so please send me a mapping once that is done.

Source_ad_id is the identifier from the East Texas site.

I joined my newspaper table with this one for the sake of simplicity so you will need to get the unique instances of the newspaper columns and return back to me a mapping for your new newspaper ids.

Let me know if you have any questions.

Thanks,

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell

Newspaper Map

Added the East Texas Ads.

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell