Tuesday, March 8, 2011

Re: How to process 1000 files xml to 1 file?

HI - Saw this post just now.

   I googled for yacc xml grammar. This url has a yacc grammar for download:

    http://www.w3.org/XML/9707/XML-in-C

   I downloaded the tar - it has 3 files: scanner.l, parser.y, main.c

   There is no make file - but compilation is simple:

     > yacc -d parser.y
     > flex scanner.l
     > gcc y.tab.c lex.yy.c main.c

I created "input" an file

<Patient>
  <lbpa_Npa>1234</lbpa_Npa>
  <lbpa_Nai>02-Oct-1923 00:00:00</lbpa_Nai>
  <Entree>15-Oct-1582 01:00:00</Entree>
  <Pid>0</Pid>
  <Ncas>0</Ncas>
  <pre10 />
  <lbpa_Pre>Peter</lbpa_Pre>
  <lbpa_Num_Npat>1234567</lbpa_Num_Npat>
  <lbrq_Nom1 />
  <lbpa_Adr2>Paris</lbpa_Adr2>
  <lbrq_Nom2 />
  <lbpa_Sexe>M</lbpa_Sexe>
  <nom10 />
  <lbrq_Rid>0</lbrq_Rid>
  <Actif />
  <lbpa_Adr />
  <lbpa_Nom>Smith</lbpa_Nom>
  <Adm />
</Patient>

then did  ./a.out < input

and the parser split it into tokens.


However an input file like this does not parse.

<Patient>
  <lbpa_Npa>1234</lbpa_Npa>
  <lbpa_Nai>02-Oct-1923 00:00:00</lbpa_Nai>
  <Entree>15-Oct-1582 01:00:00</Entree>
  <Pid>0</Pid>
  <Ncas>0</Ncas>
  <pre10 />
  <lbpa_Pre>Peter</lbpa_Pre>
  <lbpa_Num_Npat>1234567</lbpa_Num_Npat>
  <lbrq_Nom1 />
  <lbpa_Adr2>Paris</lbpa_Adr2>
  <lbrq_Nom2 />
  <lbpa_Sexe>M</lbpa_Sexe>
  <nom10 />
  <lbrq_Rid>0</lbrq_Rid>
  <Actif />
  <lbpa_Adr />
  <lbpa_Nom>Smith</lbpa_Nom>
  <Adm />
</Patient>
<Demande>
  <Entree>15-Oct-1582 01:00:00</Entree>
  <lbde_Rid>12345</lbde_Rid>
  <lbde_Nlab>12345</lbde_Nlab>
  <Sortie>15-Oct-1582 01:00:00</Sortie>
  <NarunaFile />
  <Ncas>0</Ncas>
  <Etabl />
  <lbde_Num_Npat>12345</lbde_Num_Npat>
  <Naruna />
  <Date_Mod>01-Jan-1900 00:00:00</Date_Mod>
  <Taille>0</Taille>
  <lbde_pid>12345/111</lbde_pid>
  <TCollection>0</TCollection>
  <Semgr>0</Semgr>
  <lbrq_nom1 />
  <lbde_Dtprv>02-Mar-2011 06:00:00</lbde_Dtprv>
  <Pathologique>FALSE</Pathologique>
  <lbrq_nom2 />
  <Bacterio>FALSE</Bacterio>
  <Volume>0</Volume>
  <Type_www />
  <Poids>0</Poids>
  <lbde_Dtdem>02-Mar-2011 07:18:32</lbde_Dtdem>
  <PasVue>FALSE</PasVue>
  <par />
  <Domaine />
</Demande>

So maybe you need to put the "Demande"/ "Analyse" data into separate files (using a vim macro perhaps?) and associate it with the original (like a join on some key (patient id?) ):
For example rapport_33405954_Patient.xml, rapport_33405954_Demande.xml, rapport_33405954_Analyse.xml  and call the yacc parser 3ice on each file and modify the parser to generate a data file with SPSS labels.

Code skeleton ...

int patient_key=0;
for (string s = xml files in dir /* but dont match against _Patient, _Demande, _Analyse xml files - just the originals*/ ) {
      string s_patient = s + String("_Patient.xml"); // not exactly since your file name is "rapport_33405954.xml"
                                                                            //you have to insert "_Patient" before the .xml
      yyin = fopen (s_patient.c_str(), "rb");
      yyrestartt(yyin);
      yyparse(); // in parser dump required data to some safe place using patient key
      // at this point patient_key has been set by modified parser
      fclose(yyin);

      string s_demande = s + String("_Demande.xml");
      yyin = fopen (s_demande.c_str(), "rb");
      yyrestartt(yyin);
      yyparse(); // in parser dump required data to  safe place using patient key
      fclose(yyin);

      string s_analyse = s + String("_Analyse.xml");
      yyin = fopen (s_analyse.c_str(), "rb");
      yyrestartt(yyin);
      yyparse(); // in parser dump required data to  safe place using patient key
      fclose(yyin);

      // you have everything that you need in safe place - output spss labels
      // output spss data
}

> Each blood test has a variable number of components

Since you have mentioned "variable number of components" a parser is a good bet.

Bests,
Neil
   
  

On Tue, Mar 8, 2011 at 12:34 AM, AMDx64BT <amdx64bt@gmail.com> wrote:

-- DESCRIPTION -----------------------

I would like to write a script with awk or vim to process Lab Blood Tests in xml format to import with SPSS.

Each blood test is an xml file.

If I have (for example):
1000 Lab Blood Tests (1000 xml files)
250 patients
4 blood tests/patient

Each Blood Test file has the name format: rapport_33405954.xml

Each blood test has a variable number of components but I am interested to analyze only 3 elements: K, Na and Ca. (These elements are not included in all the Blood Tests.)


-- PATIENT -----------------------

-- SOURCE:

<Patient>
  <lbpa_Npa>1234</lbpa_Npa>
  <lbpa_Nai>02-Oct-1923 00:00:00</lbpa_Nai>
  <Entree>15-Oct-1582 01:00:00</Entree>
  <Pid>0</Pid>
  <Ncas>0</Ncas>
  <pre10 />
  <lbpa_Pre>Peter</lbpa_Pre>
  <lbpa_Num_Npat>1234567</lbpa_Num_Npat>
  <lbrq_Nom1 />
  <lbpa_Adr2>Paris</lbpa_Adr2>
  <lbrq_Nom2 />
  <lbpa_Sexe>M</lbpa_Sexe>
  <nom10 />
  <lbrq_Rid>0</lbrq_Rid>
  <Actif />
  <lbpa_Adr />
  <lbpa_Nom>Smith</lbpa_Nom>
  <Adm />
</Patient>

-- RESULT:

(first_name second_name, date_born)
lbpa_Nom lbpa_Pre, lbpa_Nai
Smith Peter, 1923.10.02


-- DATE TAKEN BLOOD -----------------------

--SOURCE:

<Demande>
  <Entree>15-Oct-1582 01:00:00</Entree>
  <lbde_Rid>12345</lbde_Rid>
  <lbde_Nlab>12345</lbde_Nlab>
  <Sortie>15-Oct-1582 01:00:00</Sortie>
  <NarunaFile />
  <Ncas>0</Ncas>
  <Etabl />
  <lbde_Num_Npat>12345</lbde_Num_Npat>
  <Naruna />
  <Date_Mod>01-Jan-1900 00:00:00</Date_Mod>
  <Taille>0</Taille>
  <lbde_pid>12345/111</lbde_pid>
  <TCollection>0</TCollection>
  <Semgr>0</Semgr>
  <lbrq_nom1 />
  <lbde_Dtprv>02-Mar-2011 06:00:00</lbde_Dtprv>
  <Pathologique>FALSE</Pathologique>
  <lbrq_nom2 />
  <Bacterio>FALSE</Bacterio>
  <Volume>0</Volume>
  <Type_www />
  <Poids>0</Poids>
  <lbde_Dtdem>02-Mar-2011 07:18:32</lbde_Dtdem>
  <PasVue>FALSE</PasVue>
  <par />
  <Domaine />
</Demande>

-- RESULT:

(Date_taken_blood)
lbde_Dtprv
2011.03.02


-- ELEMENT -----------------------

-- SOURCE:

<Analyse>
  <OrdreImpression>12345</OrdreImpression>
  <CodeMateriel />
  <TypeLigne>0</TypeLigne>
  <Formulaire>21</Formulaire>
  <Norme>136 - 145 mmol/l</Norme>
  <Code>2039</Code>
  <Commentaire />
  <Anterieur />
  <TypeResultat>0</TypeResultat>
  <Resultat>136</Resultat>
  <Unite>mmol/l</Unite>
  <Remarque />
  <Clos>O</Clos>
  <Libelle>Sodium</Libelle>
</Analyse>

-- RESULT:

(Element number)
Libelle Resultat
Sodium 136


-- SORT ELEMENT BY DATE -----------------------

Sodium 05.01.2011 --> Na1
Sodium 08.01.2011 --> Na3
Sodium 06.01.2011 --> Na2


-- FINAL RESULT -----------------------

From 1000 files I want to obtain a file with this format. To be able to import it with SPSS:

                                        Na1     Na2    K1      K2
Smith Peter 19231002       136      133     4        3.5
Gates Edward 19801204    145      166     3.1     3.4

(In this case the date of Na1 of Smith and Gates, could be different, but the variable Na1 is the same)


Any advice is appreciated

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

--
You received this message from the "vim_use" maillist.
Do not top-post! Type your reply below the text you are replying to.
For more information, visit http://www.vim.org/maillist.php

No comments: