Skip to main content

Field Meta

Field meta is the following extra information about a field.

{
name: "s7", // field name (string)
type: "Numeric", // {Numeric, String, DateTime, Boolean, Object, ResultCell, Unknown}
label: "US Region",
valueLabels: [
{ val: 1, label: "Northeast" },
{ val: 2, label: "Midwest" },
{ val: 3, label: "South" },
{ val: 4, label: "West" },
]
}

.name Field name.

.type Data type {Numeric, String, DateTime, Boolean, Object, ResultCell, Unknown}

  • label: A description of the field
  • valueLabels: Descriptions of values that the field may contain
  • fields: sub-fields (if type is Object)

The SPSS file format (*.sav) stores field meta. So, importing an SPSS file as a data source means that field meta may exist. Other file formats such as CSV or Excel don't have field meta and you may need to add it manually.

Stages attempt to preserve field meta wherever possible. See the doc for each Stage to learn if/how each stage modifies field meta.

Field meta can be used by output visualizations, e.g., as column headers, legend labels, etc.

Name (.name)

A unique identifier string of any length.

Label (.label)

A string description of field.

Data Type (.type)

One of the following values:

  • Numeric (number) - an integer or flat

  • String (string)

  • DateTime (date)

  • Boolean (bool)

  • ValueCell (value-cell) - an object with { .value (default), .n, (.freq), (.wn), (.uwn) }

  • LabelCell (label-cell) - an object with { .label (default), (.syntax), (.name), (.valueFormat) }

  • Object

  • Unknown

Note for Rob:

  • When inspector encounters a value-cell, it will injest .value
  • When inspector encounters a label-cell, it will injest .label
  • Also, when data writer writes a value-cell or label-cell, it will write .value or .label respectively and ignore the other properties. (Maybe it could provide the option to break those out into separate fields, or maybe author would need to break them out if they really want them). Actually yeah the .n is most useful for low-n indication in output visualizations, and/or filtering out data cells with low n. And the .syntax is mostly for debugging. So I think they can be ignored when writing out a data file - unless the author decides to pull them out using an addFields stage.
  • In syntax, references to an object without the dot specifier will refer to the default property (.value or .label). For example: sort by q1 means sort by q1.value if q1 is a value-cell.
// sort stage example
["q1"]

// sort stage example
["q1.value"]

// add fields example
{
"gap": "seg2 - seg1"
}

// add fields example (same as above)
{
"gap": "seg2.value - seg1.value"
}

// example filtering on n
["q1.n > 50"]

note

I need to figure out how to handle returning weighted vs. unweighted n. Is .n always unweighted and a weighted n is .wn?

Actually I should also consider the way I return .value (weighted vs unweighted). It seems like .value should be weighted (if any), otherwise unweighted.

Maybe this is a user preference setting.??

// results from an unweighted calc
{
value: 0.45,
freq: 45,
n: 100
}

// results from a weighted calc
{
"value": 0.42, // weighted
"freq": 41.3, // weighted
"n": 98.3, // weighted (maybe a user preference dictates which is priority: weighted or unweighted)

// depending on the above priority, only one of the following would return
"uwN": 100 // unweighted N (uN or uwN?)
"wN": 98.3 // weighted N

// this is probably too much to return (wasteful)
"unweighted": {
"value": 0.45,
"freq": 45,
"n": 100 // PEOPLE NEED THIS THOUGH
},


}

about valueFormat

The results pane, when rendering a numerical value (type number, valueCell, or maybe date), will search for a valueFormat:

  • First- it will check its column definition for a valueFormat
  • Second- it will check its row for a labelCell containing a valueFormat, using the leaf cell (latest occurance) as a priority if multiple exist.

Value Labels (.valueLabels)

Child Fields (.fields)

Array of child fields of type Field Meta. These may exist if Data Type is Object. I might not have this though. Would be compatible with MongoDb, but could get overly complex. Oh, what about nested groups though!?!? How in the world are those returned?

Define Raw

Field meta may be defined as follows:

// val can be an int (can it be a double? i don't think so)
{
label: "US Region",
valueLabels: [
{ val: 1, label: "Northeast" },
{ val: 2, label: "Midwest" },
{ val: 3, label: "South" },
{ val: 4, label: "West" },
]
}

Define Raw

// val can be a string
// I MIGHT NOT ALLOW THIS. SURE SPSS DOES, BUT MAYBE I WON'T.
// Spss only allows a few chars for val, which doesn't seem amazingly helpful.
// But it could be helpful.
// Allowing strings might create confusion if merging a set of valuelabels with different types (string vs. int)
// I guess in those instances the conflict resolver would need to pick just one.
// ... because I should probably force valueLabels to be EITHER ints or strings, but not both.
// Maybe the SPSS reader could replace the chars with ints, but then how can we ensure
// each file would use the same indexing? Not a good idea.
{
label: "US State",
valueLabels: [
{ val: "AL", label: "Alabama" },
{ val: "AK", label: "Alaska" },
{ val: "AZ", label: "Arizona" },
{ val: "AR", label: "Arkansas" },
...
{ val: "WY", label: "Wyoming" },
]
}

Load from another field

Actually this is just a placeholder, because loading from another field depends on the context. The possible contexts are:

  • When using addFields or Select stage in pipeline (see below)
  • When stacking files in a data flow (see below)
{
$fromField: "Q1"
}

Example in Select Stage

{
$selectFields: [
{ name: "Q1" }, // this will automatically inherit meta from Q1
{ name: "mm", syntax: "month" }, // syntax is smart enough to pull meta from month
{ name: "S3", syntax: "wave < 5 ? null : S3" }, // syntax evaluator pulls meta from S3
{
name: "Q1_rebased",
syntax: "ifnull(Q1,0)"

// this syntax could figure out it should still probably inherit meta from Q1, but should it?
// unless it's something like ifnull(Q1,Q2) -- then it wouldn't know the meta
// i think if syntax is provided, and the syntax isn't a columnExpression,


// do we ask for this?
// it would suck if you have a large bank of vars to rebase

metaFrom: "Q1" // pulls label and valuelabels from Q1

// or:
label: { $fromField: "Q1" }, // pulls label from Q1
valueLabels: { $fromField: "Q1" } // pulls valuelabels from Q1
},
{
name: "Q1_rollup",
syntax: "Q1 in (3,4,5) ? 3 : Q1", // this syntax doesn't know how to pull meta

// meta goes here:
label: "Recoded question about something",
valueLabels: [
{ val: 1, label: "Item one" },
{ val: 2, label: "Item two" },
{ val: 3, label: "Item three, four or five" }
]
}

]
}

Stacking files (in data flow)

When files from different waves are stacked, the data flow should check the meta and warn about discrepancies. It won't warn about every difference. It will only warn if there are conflicts.

A conflict is:

  • A field label is different
  • A value label is different (same value but different label)

A noted discrepancy (not a conflict) is:

  • A new or dropped field
  • A new value label on an existing field
  • A dropped value label on a propagated field

Should we also perform an inspection an notify discrepancies?

  • Maybe if a user wants to manually inspect

How are conflicts resolved?

  • When a field label is different:
    • Pick which label to use. The stack stage needs a conflictResolution section which specifies.
  • When a value label is different (same value but different label)
    • User could pick which label to use (in a conflictResolution spec)
    • If the meaning is different, user should create a recode in advance of the stack stage
{
$stack: {
ds1: "<datasource 1 id>",
ds2: "<datasource 2 id>",

// metaPick is OPTIONAL
// It tells the stacker which meta to use.
// If not provided, the stacker will MERGE meta from both dataset and stop only if there are conflicts
// It is used for resolution of conflicts (if any)

// take meta from second dataset
metaPick: 2 // use second data source for all meta

// or:

// object layout concept:
metaPick: {
Q1: 1, // take label and valueLabels from first dataset
Q3: {
label: 2 // take label from second dataset
// note: valueLabels will be auto-merged
},
Q4: {
// note: label not affected
valueLabels: 2 // take all valueLabels from second dataset
},
Q5: {
valueLabels: {
"3": 2 // take value label for val 3 from second dataset
}
},
Q10: {
label: { $merge: "{1} / {2}" } // concatenate the two labels
},
Q11: {
valueLabels: {
"3": { $merge: "pre 2022: {1}; 2022+: {2}" } // merge the labels
}
}
},

// array layout concept:
metaMerge: [ // OPTIONAL!

// for Q1, take meta from first dataset
{ $pickField: { field: "Q1", from: 1 }}

// for Q2, use the label from first dataset
{ $pickLabel: { field: "Q2", from: 1 } }

// for Q2, concatenate the labels
{ $pickLabel: { field: "Q2", concat: "{1} / {2}" } }

// for Q3's value 7, use the valuelabel from the second dataset
{ $pickValueLabel: { field: "Q3", val: "7", from: 2 }}


// should there exist a fallback?
// i.e. "for all other conflicts, use ..."
// no. because it's not conflict only.

// regardless of conflicts, take meta from second dataset
// (this should be the only metaFlow statement as it would trump all others)
{ $pickAll: { use: 2 } } // NOT USING THIS


]
}
}


Pipeline Output

A pipeline returns: data, fields, and maybe nestedFields.

Perhaps the fields array (and arrays within nestedFields) could be returned as dicts if requested, in case the author doesn't care about ordering.

fields is either an array or an object with an .entries array. If object, it can contain miscellanous metadata for that dimension??

nestedFields is a dict, but each dict entry is either an array or an object with an .entries array. If object, it could contain arbitrary metadata for display purposes.

Each entry item can contain arbitrary metadata.

{
fields: [
{ name: "Q1", label: "Question about something", valueLabels: {...} },
{ name: "Q1_rebased", label: "Question about something", valueLabels: {...} },
{ name: "Q1_rollup", label: "Recoded question about something", valueLabels: {...} },
],
data: [
// rows here
]
}

Pipeline Output Nested Fields (from Aggregation Stage)

{
fields: [ // top level fields
{ name: "label": label: "Some label" },

// from dim1
{ name: "m1", label: "2020 Jan", category: "Month", nest: "segDim" }, // ??
{ name: "m2", label: "2020 Feb", category: "Month", nest: "segDim" }, // ??
{ name: "m3", label: "2020 Mar", category: "Month", nest: "segDim" }, // ??
],
nestedFields: {
segDim: [
// from dim2
{ name: "seg1", label: "Segment 1", category: "Segment", type: "cell" }, // cell has val, n, maybe freq?
{ name: "seg2", label: "Segment 2", category: "Segment", type: "cell" },
],
],
data: [
// first row
{
label: "Product A1000",
m1: {
seg1: { val: 12.34, n: 1000 },
seg2: { val: 12.34, n: 1000 },
},
m2: {
seg1: { val: 12.34, n: 1000 },
seg2: { val: 12.34, n: 1000 },
},
m3: {
seg1: { val: 12.34, n: 1000 },
seg2: { val: 12.34, n: 1000 },
},
},
// next rows go here
]
}