Runaway Update

Hey Bill, here is a quick update on the Runaway project:

-The Turks have finished transcribing about 7,000 ads. I’ve broken the ads up into batches of 1,000 ads (2,000 jobs) and post them one at a time because ive discovered that if I post too many at once the turks will do all the easy transcriptions and skip difficult ones leaving the batches only ~75% done. At this rate I expect we should have all 14,000 ads transcribed sometime mid next week if they keep this pace up.

-I am pushing the ads through our data cleaning process as the batches complete, so far I’ve found no problems but I’ll continue doing spot checks just to be sure. I designed the system to be very forgiving if we need to adjust which transcript is the best or how we are grouping into events.

-After some research this week into various text feature extraction projects I’ve started building a POC model which will hopefully act as the framework for every feature extraction task we want to run. I’m doing my initial test on rewards. I want the AI to read the entire ad and tell me all the different rewards offered. This is harder than it sounds because one ad usually offers multiple rewards and makes multiple mentions of the same reward, if the AI is successful it will tell me all the unique rewards and where in the text they are mentioned. This is the same kind of problem that will need to be solved with all feature extractions, one runaway could be mentioned multiple times or different runaways could be mentioned once, a successful AI will know the difference.

I’ll continue to touch base as progress is made.

Runaway Update, AI POC Model for Feature Extraction

The Turks have completed 10,000 ad transcriptions, I have added these to our database and assigned them event IDs. I am hopeful that we will be 100% done with the remaining ads by tomorrow.

I have also been working on an AI model for feature extraction. My first version of the model I picked the features: Reward, Name, Age and Skin Tone. The job of the AI is to pull out of the text the parts that match those categories so that the data can be further analyzed (for example the skin tone extraction will need to be run through a secondary model that groups into the three categories we discussed).

I’m pretty happy with the results for it being a first attempt, here are some examples:

You can see the model made one mistake (Name: faced) but got everything else right including the complicated ad listing multiple runaways, and this is with a very small amount of training data. I’m pretty confident that with some more training data and a little more work hyperparameter tuning the model accuracy will rise to useable levels. That said I think we will probably want an MTurk review of whatever results we come up with, it will still be much cheaper, faster and accurate than a purely MTurk solution.

I think we need to decide exactly what our features are going to be so I can MTurk some more training data. I can always add more features later but the training set will have to be updated to include whatever we add. Here is a list of features I think we should include, please let me know what to add/change:

-Name

-Age

-Skin Tone (three categories: light, average, dark)

-Reward Offered

-Secondary rewards offered

-Time of year Runaway event occurred

-Runaway location origin

Thanks,

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell

11,743 Transcriptions

These represent ~1,174 hours of work from 398 different MTurk workers. We did two transcriptions per ad so the gross total is double that number of hours. You are welcome to use these however you want but I might still make some further changes to the transcripts (if I find some im not happy with I might select a different final version or make other corrections).

Right now I’m working on feature extraction, I’m starting with an AI extraction but ill probably feed the results of that into MTurk for final quality control. It’s going to take me some time to complete the model but my proof of concept version is so far giving me decent results for a limited number of features:

Feature extraction will occur at the ad level, then we will use the most popular response per event group as final features, that way we can reclassify ads to different events (if needed) without having to redo any feature extraction and see the downstream results right away.

The features I’m going to extract are:

-Name

-Age

-Sex

-Skin tone/complexion (three categories: light, average, dark)

-Height

-Weight

-Eye color

-Scars or identifying marks

-Language spoken

-Reward offered

-Secondary rewards offered

-Time of year runaway event occurred

-Runaway location origin

-Occupation or skill set

-Enslaver name

Let me know if you want me to add anything. I’ll begin work on splitting up the new ads you sent me after I’ve made some more progress on feature extraction. Ill touch base as I have new results to share.

Thanks,

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell

11,743 Transcriptions

Wow! Quick question, are these “new,” are they corrected versions of transcripts already crowdsourced into the DB, or a mix of both?

Thanks,
Ed

PS: Bill and Eric, if you are up for another call, let me know. I have a favor to ask, based on your deep engagement and ideas for improvement of the database.

Edward E. Baptist

Professor, Department of History

450 McGraw Hall

Cornell University

Ithaca, NY 14853 USA

Freedomonthemove.org

From: Eric Anderson, CFA
Sent: Wednesday, April 29, 2020 1:27 PM
To: Edward Eugene Baptist; Subject: 11,743 Transcriptions

These represent ~1,174 hours of work from 398 different MTurk workers. We did two transcriptions per ad so the gross total is double that number of hours. You are welcome to use these however you want but I might still make some further changes to the transcripts (if I find some im not happy with I might select a different final version or make other corrections).

Right now I’m working on feature extraction, I’m starting with an AI extraction but ill probably feed the results of that into MTurk for final quality control. It’s going to take me some time to complete the model but my proof of concept version is so far giving me decent results for a limited number of features:

Feature extraction will occur at the ad level, then we will use the most popular response per event group as final features, that way we can reclassify ads to different events (if needed) without having to redo any feature extraction and see the downstream results right away.

The features I’m going to extract are:

-Name

-Age

-Sex

-Skin tone/complexion (three categories: light, average, dark)

-Height

-Weight

-Eye color

-Scars or identifying marks

-Language spoken

-Reward offered

-Secondary rewards offered

-Time of year runaway event occurred

-Runaway location origin

-Occupation or skill set

-Enslaver name

Let me know if you want me to add anything. I’ll begin work on splitting up the new ads you sent me after I’ve made some more progress on feature extraction. Ill touch base as I have new results to share.

Thanks,

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell

11,743 Transcriptions

These are all the English ads of sufficient resolution which did not already have a transcript.

From: Edward Eugene Baptist <eeb36@cornell.edu>
Sent: Wednesday, April 29, 2020 12:30 PM
To: Eric Anderson, CFA <eanderson@skylarcap.com>; Brandon T. Kowalski <kowalski@cornell.edu>
Cc: Bill Perkins <BPerkins@skylarcap.com>;
Subject: Re: [External] 11,743 Transcriptions

Wow! Quick question, are these “new,” are they corrected versions of transcripts already crowdsourced into the DB, or a mix of both?

Thanks,
Ed

PS: Bill and Eric, if you are up for another call, let me know. I have a favor to ask, based on your deep engagement and ideas for improvement of the database.

Edward E. Baptist

Professor, Department of History

450 McGraw Hall

Cornell University

Ithaca, NY 14853 USA

Freedomonthemove.org

From: Eric Anderson, CFA
Sent: Wednesday, April 29, 2020 1:27 PM
To: Edward Eugene Baptist; Subject: 11,743 Transcriptions

These represent ~1,174 hours of work from 398 different MTurk workers. We did two transcriptions per ad so the gross total is double that number of hours. You are welcome to use these however you want but I might still make some further changes to the transcripts (if I find some im not happy with I might select a different final version or make other corrections).

Right now I’m working on feature extraction, I’m starting with an AI extraction but ill probably feed the results of that into MTurk for final quality control. It’s going to take me some time to complete the model but my proof of concept version is so far giving me decent results for a limited number of features:

Feature extraction will occur at the ad level, then we will use the most popular response per event group as final features, that way we can reclassify ads to different events (if needed) without having to redo any feature extraction and see the downstream results right away.

The features I’m going to extract are:

-Name

-Age

-Sex

-Skin tone/complexion (three categories: light, average, dark)

-Height

-Weight

-Eye color

-Scars or identifying marks

-Language spoken

-Reward offered

-Secondary rewards offered

-Time of year runaway event occurred

-Runaway location origin

-Occupation or skill set

-Enslaver name

Let me know if you want me to add anything. I’ll begin work on splitting up the new ads you sent me after I’ve made some more progress on feature extraction. Ill touch base as I have new results to share.

Thanks,

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell

11,743 Transcriptions

I can be available right now eric can call my cell and conference us in orbavaik at 4pm est. Eric try and conference us +17133986024

Sent from my Samsung Galaxy smartphone.

11,743 Transcriptions

We also already error corrected the earlier ones using Mturk

Sent from my Samsung Galaxy smartphone.

11,743 Transcriptions

Now is good, if everyone is available. 607-379-9049

Edward E. Baptist

Professor, Department of History

450 McGraw Hall

Cornell University

Ithaca, NY 14853 USA

Freedomonthemove.org

From: Bill Perkins <BPerkins@skylarcap.com>
Sent: Wednesday, April 29, 2020 1:33:32 PM
To: Eric Anderson, CFA <eanderson@skylarcap.com>; Edward Eugene Baptist <eeb36@cornell.edu>; Brandon T. Kowalski <kowalski@cornell.edu>
Cc: <>
Subject: RE: 11,743 Transcriptions

I can be available right now eric can call my cell and conference us in orbavaik at 4pm est. Eric try and conference us +17133986024

Sent from my Samsung Galaxy smartphone.

Runaway Update, Feature Extraction App

I’m working on extracting the features from the ads. As discussed, I would like to try with an AI and then either have the Turks go over the results or if the AI is not accurate enough just have the Turks do the entire extraction.

To accomplish either the AI or the Turk extraction we will need a survey the Turks can use that will both create AI training data with text annotation and complete feature extraction for the final result. I’m creating a web app that will let us to do both. The Turks will use this app to annotate the text of the runaway ads in a way that makes them useful for AI training data, but once the ad text is annotated each grouped annotation automatically becomes an available option in the dropdown inputs for a dynamic number of runaways.

This web app will be one tool that lets us do what we want no matter which direction we go. I’m about 60% done with development of it. Below is a video of a POC version of the app I built, obviously there will be many more feature categories in the final version.

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell

Latest on FOTM

Almost done with the web app, once complete ill use that to have the turks generate some training data on 1,000 ads or so. From there ill be working on a final version of the feature extraction AI and if the results are good ill apply it to the entire dataset if not then ill have the turks do all feature extraction.

I think the web app gives us a really powerful tool to use in understanding the data while getting features out, because the turks are required to tag every feature back to a section of the ad transcript it creates a reference that goes all the way back to the source which can be used later if we have questions about the results we are getting. This is different from the FOTM method which is just simple fill in the blank input. The text annotation method is necessary for AI training but the added benefit will be a much more connected dataset.

Eric Anderson, CFA

Quantitative Analyst

Skylar Capital Energy Global Master Fund LP

(713) 341-7985 work

(281) 606-9371 cell