Skip to content

Question on input data format #2

@poorboy44

Description

@poorboy44

example.sh describes the input format as:

This file provides information about running the Dynamic Topic Model
or the Document Influence Model.  It gives two command-line examples
for running the software and several example commands in R for reading
output files.

Dynamic topic models and the influence model have been implemented
here in c / c++.  This implementation takes two input files:

 (a) foo-mult.dat, which is one-doc-per-line, each line of the form

   unique_word_count index1:count1 index2:count2 ... indexn:counnt

   where each index is an integer corresponding to a unique word.

 (b) foo-seq.dat, which is of the form

   Number_Timestamps
   number_docs_time_1
   ...
   number_docs_time_i
   ...
   number_docs_time_NumberTimestamps

   - The docs in foo-mult.dat should be ordered by date, with the first
     docs from time1, the next from time2, ..., and the last docs from
     timen.

test-mult.dat looks like this (1000 lines):

28 12:1 44:1 75:10 76:1 77:1 78:1 79:1 80:1 81:2 82:1 83:1 84:1 85:1 86:1 87:2 88:4 89:1 90:1 91:1 92:1 93:1 94:1 95:1 96:1 97:1 98:1 99:2 100:1
60 771:1 388:1 98:1 134:1 8:1 908:1 1037:1 600:1 405:1 1046:1 516:1 27:2 773:1 37:1 1137:1 1138:1 302:1 433:2 51:1 59:1 999:1 1119:1 224:1 67:1 69:1 71:1 584:1 330:1 77:1 269:1 337:1 83:1 1112:1 349:2 1118:1 1125:1 1120:1 1121:1 1122:1 1123:1 1124:1 101:2 1126:1 1127:1 488:1 1129:1 618:3 1131:1 1132:1 1133:1 1134:1 1135:1 1136:2 1128:1 114:4 1139:1 1140:1 1141:1 631:1 1130:1
17 257:1 546:2 547:1 548:1 549:1 6:1 551:1 552:1 553:1 554:1 550:2 418:1 174:1 433:1 315:1 92:1 415:1
11 288:1 1248:1 5:1 1063:2 269:1 654:1 656:2 532:1 373:1 1247:1 543:1
25 909:1 407:1 797:1 543:1 555:1 693:1 823:4 569:1 1226:1 1227:1 1228:2 1229:1 1230:1 1231:4 1232:1 1233:4 1234:1 1235:1 1236:1 1237:3 1238:1 1239:1 1106:1 113:1 243:1

test-seq.dat looks like this (10 lines):

10
25
50
75
100
100
100
100

I don't understand how the time correspondence is defined between test-mult.dat (which has 1 document per line) and test-seq.dat which has the number of docs per time-period (in this case 10 time periods). Can someone clarify for me how the input data should be formatted? Are we assuming the first 10 documents in test-mult.dat correspond to time period 1, the next 25 documents correspond to time period 2, etc?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions