Field Meta
Field meta is the following extra information about a field.
{
name: "s7", // field name (string)
type: "Numeric", // {Numeric, String, DateTime, Boolean, Object, ResultCell, Unknown}
label: "US Region",
valueLabels: [
{ val: 1, label: "Northeast" },
{ val: 2, label: "Midwest" },
{ val: 3, label: "South" },
{ val: 4, label: "West" },
]
}
.name
Field name.
.type
Data type {Numeric, String, DateTime, Boolean, Object, ResultCell, Unknown}
- label: A description of the field
- valueLabels: Descriptions of values that the field may contain
- fields: sub-fields (if type is Object)
The SPSS file format (*.sav) stores field meta. So, importing an SPSS file as a data source means that field meta may exist. Other file formats such as CSV or Excel don't have field meta and you may need to add it manually.
Stages attempt to preserve field meta wherever possible. See the doc for each Stage to learn if/how each stage modifies field meta.
Field meta can be used by output visualizations, e.g., as column headers, legend labels, etc.
Name (.name)
A unique identifier string of any length.
Label (.label)
A string description of field.
Data Type (.type)
One of the following values:
- Numeric
- String
- DateTime
- Boolean
- Object
- ResultCell
- Unknown
Value Labels (.valueLabels)
Child Fields (.fields)
Array of child fields of type Field Meta. These may exist if Data Type is Object.
Define Raw
Field meta may be defined as follows:
// val can be an int (can it be a double? i don't think so)
{
label: "US Region",
valueLabels: [
{ val: 1, label: "Northeast" },
{ val: 2, label: "Midwest" },
{ val: 3, label: "South" },
{ val: 4, label: "West" },
]
}
Define Raw
// val can be a string
// I MIGHT NOT ALLOW THIS. SURE SPSS DOES, BUT MAYBE I WON'T.
// Spss only allows a few chars for val, which doesn't seem amazingly helpful.
// But it could be helpful.
// Allowing strings might create confusion if merging a set of valuelabels with different types (string vs. int)
// I guess in those instances the conflict resolver would need to pick just one.
// ... because I should probably force valueLabels to be EITHER ints or strings, but not both.
// Maybe the SPSS reader could replace the chars with ints, but then how can we ensure
// each file would use the same indexing? Not a good idea.
{
label: "US State",
valueLabels: [
{ val: "AL", label: "Alabama" },
{ val: "AK", label: "Alaska" },
{ val: "AZ", label: "Arizona" },
{ val: "AR", label: "Arkansas" },
...
{ val: "WY", label: "Wyoming" },
]
}
Load from another field
Actually this is just a placeholder, because loading from another field depends on the context. The possible contexts are:
- When using addFields or Select stage in pipeline (see below)
- When stacking files in a data flow (see below)
{
$fromField: "Q1"
}
Example in Select Stage
{
$selectFields: [
{ name: "Q1" }, // this will automatically inherit meta from Q1
{ name: "mm", syntax: "month" }, // syntax is smart enough to pull meta from month
{ name: "S3", syntax: "wave < 5 ? null : S3" }, // syntax evaluator pulls meta from S3
{
name: "Q1_rebased",
syntax: "ifnull(Q1,0)"
// this syntax could figure out it should still probably inherit meta from Q1, but should it?
// unless it's something like ifnull(Q1,Q2) -- then it wouldn't know the meta
// i think if syntax is provided, and the syntax isn't a columnExpression,
// do we ask for this?
// it would suck if you have a large bank of vars to rebase
metaFrom: "Q1" // pulls label and valuelabels from Q1
// or:
label: { $fromField: "Q1" }, // pulls label from Q1
valueLabels: { $fromField: "Q1" } // pulls valuelabels from Q1
},
{
name: "Q1_rollup",
syntax: "Q1 in (3,4,5) ? 3 : Q1", // this syntax doesn't know how to pull meta
// meta goes here:
label: "Recoded question about something",
valueLabels: [
{ val: 1, label: "Item one" },
{ val: 2, label: "Item two" },
{ val: 3, label: "Item three, four or five" }
]
}
]
}
Stacking files (in data flow)
When files from different waves are stacked, the data flow should check the meta and warn about discrepancies. It won't warn about every difference. It will only warn if there are conflicts.
A conflict is:
- A field label is different
- A value label is different (same value but different label)
A noted discrepancy (not a conflict) is:
- A new or dropped field
- A new value label on an existing field
- A dropped value label on a propagated field
Should we also perform an inspection an notify discrepancies?
- Maybe if a user wants to manually inspect
How are conflicts resolved?
- When a field label is different:
- Pick which label to use. The stack stage needs a conflictResolution section which specifies.
- When a value label is different (same value but different label)
- User could pick which label to use (in a conflictResolution spec)
- If the meaning is different, user should create a recode in advance of the stack stage
{
$stack: {
ds1: "<datasource 1 id>",
ds2: "<datasource 2 id>",
// metaPick is OPTIONAL
// It tells the stacker which meta to use.
// If not provided, the stacker will MERGE meta from both dataset and stop only if there are conflicts
// It is used for resolution of conflicts (if any)
// take meta from second dataset
metaPick: 2 // use second data source for all meta
// or:
// object layout concept:
metaPick: {
Q1: 1, // take label and valueLabels from first dataset
Q3: {
label: 2 // take label from second dataset
// note: valueLabels will be auto-merged
},
Q4: {
// note: label not affected
valueLabels: 2 // take all valueLabels from second dataset
},
Q5: {
valueLabels: {
"3": 2 // take value label for val 3 from second dataset
}
},
Q10: {
label: { $merge: "{1} / {2}" } // concatenate the two labels
},
Q11: {
valueLabels: {
"3": { $merge: "pre 2022: {1}; 2022+: {2}" } // merge the labels
}
}
},
// array layout concept:
metaMerge: [ // OPTIONAL!
// for Q1, take meta from first dataset
{ $pickField: { field: "Q1", from: 1 }}
// for Q2, use the label from first dataset
{ $pickLabel: { field: "Q2", from: 1 } }
// for Q2, concatenate the labels
{ $pickLabel: { field: "Q2", concat: "{1} / {2}" } }
// for Q3's value 7, use the valuelabel from the second dataset
{ $pickValueLabel: { field: "Q3", val: "7", from: 2 }}
// should there exist a fallback?
// i.e. "for all other conflicts, use ..."
// no. because it's not conflict only.
// regardless of conflicts, take meta from second dataset
// (this should be the only metaFlow statement as it would trump all others)
{ $pickAll: { use: 2 } } // NOT USING THIS
]
}
}
Pipeline Output
A pipeline returns: data, fields, and maybe nestedFields.
Perhaps the fields array (and arrays within nestedFields) could be returned as dicts if requested, in case the author doesn't care about ordering.
fields
is either an array or an object with an .entries array. If object, it
can contain miscellanous metadata for that dimension??
nestedFields
is a dict, but each dict entry is
either an array or an object with an .entries array. If object,
it could contain arbitrary metadata for display purposes.
Each entry item can contain arbitrary metadata.
{
fields: [
{ name: "Q1", label: "Question about something", valueLabels: {...} },
{ name: "Q1_rebased", label: "Question about something", valueLabels: {...} },
{ name: "Q1_rollup", label: "Recoded question about something", valueLabels: {...} },
],
data: [
// rows here
]
}
Pipeline Output Nested Fields (from Aggregation Stage)
{
fields: [ // top level fields
{ name: "label": label: "Some label" },
// from dim1
{ name: "m1", label: "2020 Jan", category: "Month", nest: "segDim" }, // ??
{ name: "m2", label: "2020 Feb", category: "Month", nest: "segDim" }, // ??
{ name: "m3", label: "2020 Mar", category: "Month", nest: "segDim" }, // ??
],
nestedFields: {
segDim: [
// from dim2
{ name: "seg1", label: "Segment 1", category: "Segment", type: "cell" }, // cell has val, n, maybe freq?
{ name: "seg2", label: "Segment 2", category: "Segment", type: "cell" },
],
],
data: [
// first row
{
label: "Product A1000",
m1: {
seg1: { val: 12.34, n: 1000 },
seg2: { val: 12.34, n: 1000 },
},
m2: {
seg1: { val: 12.34, n: 1000 },
seg2: { val: 12.34, n: 1000 },
},
m3: {
seg1: { val: 12.34, n: 1000 },
seg2: { val: 12.34, n: 1000 },
},
},
// next rows go here
]
}